Sparse vectors are representations in a high-dimensional space, where only a small number of dimensions have non-zero values.
For example, for the same text, a dense vector representation with the BGE-M3 model would have 1024 non-zero valued dimensions. However, the sparse vector representation of the same text would have less than a hundred non-zero valued dimensions, whereas vector space potentially has more than 250 thousand dimensions. Also, unlike dense vector representations, sparse vectors might have varying non-zero valued dimensions depending on the text.
Generally, sparse vectors can be represented with two arrays of equal sizes:
Unlike dense vectors which excel at approximate semantic matching, sparse vectors are particularly useful for tasks that require exact or near exact matching of tokens/words/features. That makes it useful for various tasks, such as:
There are various ways to create sparse vectors. You can use BM25 for information retrieval tasks, or use models like SPLADE that enhance documents and queries with term weighting and expansion.
Upstash gives you full control by allowing you to upsert and query sparse vectors.
Also, to make embedding easier for you, Upstash provides some hosted models and allows you to upsert and query text data. Behind the scenes, the text data is converted to sparse vectors.
You can create your index with a sparse embedding model to use this feature.
BGE-M3 is a multi-functional, multi-lingual, and multi-granular model widely used for dense indexes.
We also provide BGE-M3 as a sparse vector embedder, which outputs
sparse vectors from 250_002
dimensional space.
These sparse vectors have values where each token is weighted according to the input text, which enhances traditional sparse vectors with contextuality.
BM25 is a popular algorithm used in full-text search systems to rank documents based on their relevance to a query.
This algorithm relies on key principles of term frequency, inverse document frequency, and document length normalization, making it well-suited for text retrieval tasks.
Upstash provides a general purpose BM25 algorithm, that applies to documents and queries in English. It tokenizes the text into words, removes stop words, stems the remaining words, and assigns a weighted value to them, based on the BM25 formula:
Where:
f(qᵢ, D)
is the frequency of term qᵢ
in document D
.|D|
is the length of document D
.avg(|D|)
is the average document length in the collection.k₁
is the term frequency saturation parameter.b
is the length normalization parameter.IDF(qᵢ)
is the inverse document frequency of term qᵢ
To make it a general purpose model, we had to decide on some of the constants mentioned above, which would differ from implementation to implementation. We decided to use the following values for
k₁
= 1.2
, a widely used value in the absence of advanced optimizationsb
= 0.75
, a widely used value in the absence of advanced optimizationsavg(|D|)
= 32
, which was chosen by tokenizing and taking the average of
MSMARCO dataset vectors, rounded
to the nearest power of two.In the future, we might provide support for more languages and the ability to provide different values for the above constants.
As for the inverse document frequency IDF(qᵢ)
, we maintain that information
per token in the vector database itself. You can use it by providing it
as the weighting strategy for your queries so that you don’t have to weight
it yourself.
You can upsert sparse vectors into Upstash Vector indexes in two different ways.
You can upsert sparse vectors by representing them as two arrays of equal sizes. One signed 32-bit integer array for non-zero dimension indices, and one 32-bit float array for the values.
Note that, we do not allow sparse vectors to have more than 1_000
non-zero valued dimension.
If you created the sparse index with an Upstash-hosted sparse embedding model, you can upsert text data, and Upstash can embed it behind the scenes.
Similar to upserts, you can query sparse vectors in two different ways.
You can query sparse vectors by representing the sparse query vector as two arrays of equal sizes. One signed 32-bit integer array for non-zero dimension indices, and one 32-bit float array for the values.
We use the inner product similarity metric while calculating the similarity scores, only considering the matching non-zero valued dimension indices between the query vector and the indexed vectors.
Note that, the similarity scores are exact, not approximate. So, if there are no vectors with one or more matching non-zero valued dimension indices with the query vector, the result might be less than the provided top-K value.
If you created the sparse index with an Upstash-hosted sparse embedding model, you can query with text data, and Upstash can embed it behind the scenes before performing the actual query.
For algorithms like BM25, it is important to take the inverse document frequencies that make matching rare terms more important into account. It might be tricky to maintain that information yourself, so Upstash Vector provides it out of the box. To make use of IDF in your queries, you can pass it as a weighting strategy.
Since this is mainly meant to be used with BM25 models, the IDF is defined as:
N
is the total number of documents in the collection.n(qᵢ)
is the number of documents containing term qᵢ
.Sparse vectors are representations in a high-dimensional space, where only a small number of dimensions have non-zero values.
For example, for the same text, a dense vector representation with the BGE-M3 model would have 1024 non-zero valued dimensions. However, the sparse vector representation of the same text would have less than a hundred non-zero valued dimensions, whereas vector space potentially has more than 250 thousand dimensions. Also, unlike dense vector representations, sparse vectors might have varying non-zero valued dimensions depending on the text.
Generally, sparse vectors can be represented with two arrays of equal sizes:
Unlike dense vectors which excel at approximate semantic matching, sparse vectors are particularly useful for tasks that require exact or near exact matching of tokens/words/features. That makes it useful for various tasks, such as:
There are various ways to create sparse vectors. You can use BM25 for information retrieval tasks, or use models like SPLADE that enhance documents and queries with term weighting and expansion.
Upstash gives you full control by allowing you to upsert and query sparse vectors.
Also, to make embedding easier for you, Upstash provides some hosted models and allows you to upsert and query text data. Behind the scenes, the text data is converted to sparse vectors.
You can create your index with a sparse embedding model to use this feature.
BGE-M3 is a multi-functional, multi-lingual, and multi-granular model widely used for dense indexes.
We also provide BGE-M3 as a sparse vector embedder, which outputs
sparse vectors from 250_002
dimensional space.
These sparse vectors have values where each token is weighted according to the input text, which enhances traditional sparse vectors with contextuality.
BM25 is a popular algorithm used in full-text search systems to rank documents based on their relevance to a query.
This algorithm relies on key principles of term frequency, inverse document frequency, and document length normalization, making it well-suited for text retrieval tasks.
Upstash provides a general purpose BM25 algorithm, that applies to documents and queries in English. It tokenizes the text into words, removes stop words, stems the remaining words, and assigns a weighted value to them, based on the BM25 formula:
Where:
f(qᵢ, D)
is the frequency of term qᵢ
in document D
.|D|
is the length of document D
.avg(|D|)
is the average document length in the collection.k₁
is the term frequency saturation parameter.b
is the length normalization parameter.IDF(qᵢ)
is the inverse document frequency of term qᵢ
To make it a general purpose model, we had to decide on some of the constants mentioned above, which would differ from implementation to implementation. We decided to use the following values for
k₁
= 1.2
, a widely used value in the absence of advanced optimizationsb
= 0.75
, a widely used value in the absence of advanced optimizationsavg(|D|)
= 32
, which was chosen by tokenizing and taking the average of
MSMARCO dataset vectors, rounded
to the nearest power of two.In the future, we might provide support for more languages and the ability to provide different values for the above constants.
As for the inverse document frequency IDF(qᵢ)
, we maintain that information
per token in the vector database itself. You can use it by providing it
as the weighting strategy for your queries so that you don’t have to weight
it yourself.
You can upsert sparse vectors into Upstash Vector indexes in two different ways.
You can upsert sparse vectors by representing them as two arrays of equal sizes. One signed 32-bit integer array for non-zero dimension indices, and one 32-bit float array for the values.
Note that, we do not allow sparse vectors to have more than 1_000
non-zero valued dimension.
If you created the sparse index with an Upstash-hosted sparse embedding model, you can upsert text data, and Upstash can embed it behind the scenes.
Similar to upserts, you can query sparse vectors in two different ways.
You can query sparse vectors by representing the sparse query vector as two arrays of equal sizes. One signed 32-bit integer array for non-zero dimension indices, and one 32-bit float array for the values.
We use the inner product similarity metric while calculating the similarity scores, only considering the matching non-zero valued dimension indices between the query vector and the indexed vectors.
Note that, the similarity scores are exact, not approximate. So, if there are no vectors with one or more matching non-zero valued dimension indices with the query vector, the result might be less than the provided top-K value.
If you created the sparse index with an Upstash-hosted sparse embedding model, you can query with text data, and Upstash can embed it behind the scenes before performing the actual query.
For algorithms like BM25, it is important to take the inverse document frequencies that make matching rare terms more important into account. It might be tricky to maintain that information yourself, so Upstash Vector provides it out of the box. To make use of IDF in your queries, you can pass it as a weighting strategy.
Since this is mainly meant to be used with BM25 models, the IDF is defined as:
N
is the total number of documents in the collection.n(qᵢ)
is the number of documents containing term qᵢ
.