Scoring Documents By The Closest One With Multiple kNN Fields
Elasticsearch is more than just a lexical (textual) search engine. Elasticsearch is versatile search engine that supports k-Nearest Neighbors (kNN) search as well as Semantic Search in addition to the traditional textual matching.
kNN search in Elasticsearch is primarily used for finding the "nearest neighbors" of a given point in a multi-dimensional space. Documents are represented as a set of numbers (vectors) that when searched, the kNN feature fetches relevant documents that are closer to the query vector. The kNN search is commonly applied in scenarios involving vectors, where vectors are created from text, images or audio by employing a process called "embeddings" using deep neural networks.
Semantic search, on the other hand is a search that is powered by natural language processing characteristics - which helps searching for relevant results based on intent and meaning than just textual match.
In this article, our focus is to work on searching through documents with multiple kNN fields, and scoring the resulting documents based on multiple kNN vector fields.
As we will be looking at the kNN search in detail in this article, let's take a couple of minutes to understand the mechanics of kNN in detail.
kNN Mechanics
The kNN (k-nearest neighbors) fetches the search results that are pretty much k-number of nearest documents to the given user's query, measured using an algorithm.
The way it works is by calculating the distance - usually Euclidean or Cosine similarity - between the vectors. When we perform a query a dataset using kNN, Elasticsearch finds the top 'k' entries that are closest to our query vector.
Before performing search related activities on the data to fecth the results, the index must be primed with appropriate embeddings - embedding is a fancy name for vectorized data. The fields are of type
dense_vector
holding numerical data.
Let's take an example:
If you have an image dataset and you've converted these images into vectors using a neural network, you can use kNN search to find images most similar to your query image. If you provide the vector representation of a "pizza" image, kNN can help you find other images that are visually similar, such as pancakes and perhaps pasta :)
kNN search is about finding the nearest data points in a vector space, thus suitable for similarity searches for text or image embeddings. In contrast, Semantic Search is about understanding the meaning and context of words in a search query, making it powerful for text-based searches where intent and context matter.
Scoring Documents
Scoring documents based on the closest document when you have multiple k-nearest neighbor (kNN) fields involves leveraging Elasticsearch's ability to handle vector similarity to rank documents. This approach is particularly beneficial in scenarios such as semantic search and recommendation engines. Or cases where we are dealing with multi-dimensional data and need to find the "closest" or most similar items based on multiple aspects (fields).
Text Embedding and Vector Fields
Let's take an example of the movies
index that consists of a few files such as title
, synopsis
and others. We will represent them using the common data types, like text
data type. In addition to these normal fields, we would create two more fields: title_vector
and synopsis_vector
field - as the name indicates - they are dense_vector
data type fields. That means, the data will be vectorized using a process called "text embedding".
The embedding model is a Natural Language Processing Neural Network which will convert the inputs to an array of numbers. The vectorized data will then stored in a dense_vector
type fields. The data documents can have multiple fields, including a few dense_vector
fields to store vector data.
So, in the following section, we'll create the index with a mix of normal and kNN fields.
Creating an Index with kNN Fields
Let's create an index called movies
that holds sample movie documents. Our documents will have multiple fields, including a couple of kNN fields to store the vector data. The following snippet demonstrates the index mapping code:
PUT /movies
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"title_vector.predicted_value": {
"type": "dense_vector",
"dims": 384
},
"synopsis": {
"type": "text"
},
"synopsis_vector.predicted_value": {
"type": "dense_vector",
"dims": 384
},
"genre": {
"type": "text"
}
}
}
}
The notable thing is that the title
field which is of type text
has an equivalent vector type field: title_vector.predicted_value
. Similarly the vector field for synopsis
is the synopsis_vector.predicted_value
field. Also, the dense vector fields have dimention (384) mentioned in the above code as dims
. This is to indicate the model will produce 384 dimensions for each of the ingested field. The maximum dimensions we can request to be produced on a dense_vector
field is 2048.
Executing this script creates a new index named movies
with two vector fields: title_vector
and synopsis_vector
.
Indexing sample docs
Now that we have an index, we can index some movies and search. In addition to the title and synopsis fields, the document will also have vector fields. Before we index the documents, we need to fill in the document with the respective vectors. The following code demonstrates a movie document sample once the vectors were generated:
POST /movies/_doc/1
{
"title": "The Godfather",
"title_vector": [0.1, 0.5, 3, 4,...], // vectorized data
"synopsis": "The aging patriarch of an organized crime dynasty....",
"synopsis_vector": [0.2, 0.6, 1, 0.7,...] // vectorized data
}
As you can see, the vector data needs to be prepared for the document to get ingested. There are a couple of ways you can do this:
- one calling the inference API on the
text_embedding
model outside of Elasticsearch to get the data vectorized, as shown above (I've mentioned it here as a reference, though we'd want to go instead using a inference processor pipeline) and - the other is to setup and use the inference pipeline.
Setting up an inference processor
We can set up an ingestion pipeline that would apply the embedding function on the relevant field to produce the vectorized data. For example, the following code creates the movie_embedding_pipeline
processor that would generate the embeddings for each field and add them to the document:
PUT _ingest/pipeline/movie_embedding_pipeline
{
"processors": [
{
"inference": {
"model_id": ".multilingual-e5-small",
"target_field": "title_vector",
"field_map": { "title": "text_field" }
}
},
{
"inference": {
"model_id": ".multilingual-e5-small",
"target_field": "synopsis_vector",
"field_map": { "synopsis": "text_field" }
}
}
]
}
The ingesting pipeline may require a bit of explanation:
- Two fields - the
title_vector
andsynopsis_vector
- which are mentioned as target fields - are thedense_vector
field types. Hence, they store vectorized data produced by themultilingual-e5-small
embedding model - The
field_map
mentions the field from the document (title
and thesynopsis
fields in this case) gets mapped to atext_field
field of the model - The
model_id
declares the embedding model that was used for embedding the data - The
target_field
is the name of the field where the vectorized data will be copied to
Executing the above code will create a movie_embedding_pipeline
ingest pipeline. That is - a document with just title
and synopsis
will be enhanced with additional fields (title_vector
and synopsis_vector
) that will have vectorized version of the content.
Indexing the Documents
The movie document will consists of title and synopsis fields as expected - so we can index it as shown below. Note that the document undergoes enhancement through the pipeline processor as enabled in the url. The following code snippet shows indexing a handful of movies:
POST movies/_doc/?pipeline=movie_embedding_pipeline
{
"title": "The Godfather",
"synopsis": "The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son."
}
POST movies/_doc/?pipeline=movie_embedding_pipeline
{
"title": "Avatar",
"synopsis": "A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his orders and protecting the world he feels is his home."
}
POST movies/_doc/?pipeline=movie_embedding_pipeline
{
"title": "Godzilla",
"synopsis": "The world is beset by the appearance of monstrous creatures, but one of them may be the only one who can save humanity."
}
POST movies/_doc/?pipeline=movie_embedding_pipeline
{
"title": "The Good, The Bad and The Ugly",
"synopsis": "A bounty hunting scam joins two men in an uneasy alliance against a third in a race to find a fortune in gold buried in a remote cemetery."
}
POST movies/_doc/?pipeline=movie_embedding_pipeline
{
"title": "A Few Good Men",
"synopsis": "Military lawyer Lieutenant Daniel Kaffee defends Marines accused of murder. They contend they were acting under orders."
}
We can surely use
_bulk
API to index the documents in one go - do checkout this Bulk API documentation for futher details.
Once these documents were indexed, you can fetch the movies to check if the vectorized contents were added by executing a search query:
GET movies/_search
This will result in the movies with two additional fields consisting of the vectorized content, as shown in the image below:
Now that we have indexed the documents, let's jump searching through these using the kNN search feature.
kNN Search
The k-Nearest Neighbors search in Elasticsearch fetches vectors (documents) that are closest to the given (query) vector. Elasticsearch supports two types of kNN search:
- Approximate kNN search
- Brute Force (or Exact) KNN Search
While both searches produce the results, brute force finds the accurate results at a cost of maximum resource utilization and time to query. Approximate kNN is good enough for majority of search cases as it offers near accurate results.
Elasticsearch provides knn
query for approximate search while we should be using script_score
query for exact kNN search.
Approximate Search
Let's run an approximate search on the movies as shown below. Elasticsearch provides a kNN query with a query_build_vector
block which consists of our query requirements. Let's write the code snippet first and discuss its constituents afterwards:
GET movies/_search
{
"knn": {
"field": "title_vector.predicted_value",
"query_vector_builder": {
"text_embedding": {
"model_id": ".multilingual-e5-small",
"model_text": "Good"
}
},
"k": 3,
"num_candidates": 100
},
"_source": [
"id",
"title"
]
}
The traditional search queries supports the query
function, however, Elasticsearch introduced knn
search function as a first class citizen to query vectors.
The knn
block consists of a field we are searching against - in this instance, it is the title vector - the title_vector.predicted_value
field. Remember, this is the name of the field we had mentioned in the mapping earlier.
The query_vector_builder
is where we need to provide our query along with the model that we need to use to embed the query. In this case, we set multilingual-e5-small
as our model and the text is simply "Good". The query in question will be vectorized by Elasticsearch using the text embedding model (multilingual-e5-small). It then compares the vector query against the available title vectors.
The k
value indicates how many documents needs to be brought back as a result.
This query should get us top three documents:
"hits": [
{
"_index": "movies",
"_id": "ZADvgo4BDf-WoG_MTka1",
"_score": 0.92932993,
"_source": {
"title": "The Good, The Bad and The Ugly"
}
},
{
"_index": "movies",
"_id": "uJ3wgo4BMlgFmHKKtFSp",
"_score": 0.91828954,
"_source": {
"title": "A Few Good Men"
}
},
{
"_index": "movies",
"_id": "tp15fY4BMlgFmHKK6VRV",
"_score": 0.90952975,
"_source": {
"title": "The Godfather"
}
}
]
The top movie scored was "The Good, The Bad and the Ugly" when we searched for "Good" against the titles. Do note that kNN search yields results always even if the resultant movies are not a match - the inherent charecteristic of a kNN match.
Take a note of the relevancy score (_score
) for each of the documents - as expected the documents are sorted based on this score.
Searching for Multiple kNN fields
We have two vector fields - the title_vector
and synopsis_vector
fields in the movie document - we can surely search against these two fields and expect the resultant documents based on the combined scores.
Let's just say we want to search for "Good" in the title but "orders" in the synopsis
field. Remember from the previous single title-field search using "Good", we retrieved the "The Good, The Bad and the Ugly" movie. Let's see which movie will be fetched given the "orders" part of the synopsis as our search.
The following code declares our multi-kNN field search:
POST movies/_search
{
"knn":[
{
"field": "title_vector.predicted_value",
"query_vector_builder": {
"text_embedding": {
"model_id": ".multilingual-e5-small",
"model_text": "Good"
}
},
"k": 3,
"num_candidates": 100
},
{
"field": "synopsis_vector.predicted_value",
"query_vector_builder": {
"text_embedding": {
"model_id": ".multilingual-e5-small",
"model_text": "orders"
}
},
"k": 3,
"num_candidates": 100
}
]
}
As you can imagine, the knn
query can accept multiple search fields as an array - here we provided search criteria from both the fields. The answer is the "A Few Good Men" as the synopsis vector that consists the "order" vector was this movie.
When do we search with multi-kNN fields
There a a few instances where we can be searching using multiple kNN fields:
-
Searching for "tweets" based on image similarity (visual kNN field) and a tweet similarity (text kNN field).
-
Recommend similar songs based on both audio features (audio as the kNN field) like tempo and rhythm; And probably title/artist/genre information (text kNN field).
-
Recommend movies or products based on the user's behavior (kNN field for user interactions) and movie/product attributes (kNN field based on these attributes).
That's a wrap. In this article, we looked at the mechanics of kNN search and how we can find the closest document when we have multiple vectorized fields.