Optimizing scalar quantization for the vector database use case allows us to achieve significantly better performance for the same retrieval quality at high compression ratios.

Introduction

Background

Error correcting the scalar dot product

Optimizing the truncation interval

Proof of principle for int4 quantization

Conclusion

Vector Database Optimized Scalar Quantization

Scalar Quantization Optimized for Vector Databases

This blog explains how int4 quantization works in Lucene, how it lines up, and the benefits of using int4 quantization.

Introduction to Int4 quantization in Lucene

How does `Int4` quantization work in Lucene

Storing and scoring the quantized vectors

Calculating the quantization error correction

Finding the optimal bucketing for int4 quantization

Speed vs. size for quantization

Speed part 2: more SIMD in int4

The end?

Int4 provides additional compression options. It reduces the quantization space to only 16 possible values (0 through 15).

What is Int4 quantization in Lucene?

Understanding Int4 scalar quantization in Lucene

Explore RAG evaluation metrics like BLEU score, ROUGE score, PPL, BARTScore, and more. Discover how Elastic is evaluating RAG with UniEval.

N-gram metrics

BLEU score

ROUGE score

METEOR score

Intrinsic metrics

Perplexity (PPL)

Model-based metrics

BERTScore

BLEURT

BARTScore

UniEval: Elastic’s choice for evaluating RAG

Real-world usage of UniEval

There are various metrics used to evaluate RAG, such as: N-gram metrics (including BLEU score, ROUGE score & METEOR score), Intrinsic metrics (like PPL), Model-based metrics (such as BERTScore, BLEURT and BARTScore), and Elastic's choice: UniEval.

What metrics are commonly used to evaluate RAG?

UniEval evaluates RAG by unifying all evaluation dimensions into a Boolean Question Answering framework, allowing a single model to assess a generated text from various angles.

How does UniEval evaluate RAG?

RAG evaluation metrics: UniEval, BLEU, ROUGE & more

RAG evaluation metrics: A journey through metrics

Explore how Elastic introduced scalar quantization into Lucene, including automatic byte quantization, quantization per segment & performance insights.

Automatic byte quantization in Lucene

Scalar quantization 101

Exploring the architecture

Quantization per segment in Lucene

Quantization that grows with you

Quantization performance & numbers

Scalar quantization is a lossy compression technique. Some simple math gives significant space savings with little impact on recall.

What is scalar quantization?

Understanding scalar quantization in Lucene

Understand what scalar quantization is, how it works and its benefits. This guide also covers the math behind quantization and examples.

Introduction to scalar quantization

Understanding buckets in scalar quantization

The role of statistics in scalar quantization

The role of algebra in scalar quantization

Ensuring accuracy in quantization

Quantization allows for vectors to be encoded in a lossy manner, thus reducing fidelity slightly with huge space savings.

What are the benefits of scalar quantization?

Scalar quantization takes each vector dimension and buckets them into some smaller data type.

How does scalar quantization work?

Scalar quantization 101: The basics, benefits and applications

Learn about the improvements we've made to the inference performance of ELSER v2.

Improved Relevance

Quantization

Block Layout of Linear Layers

Improving information retrieval in the Elastic Stack: Improved inference performance with ELSER v2

Learn about how we're reducing retrieval costs for ELSER v2.

Retrieval Cost Aware Training

Optimizing ELSER Queries

Implementing with an Elasticsearch Query

Improving information retrieval in the Elastic Stack: Optimizing retrieval with ELSER v2

Here's how generative AI works from the ground up, including embeddings, transformer-encoder architecture, training/fine-tuning models & more.

The semantic stage of GenAI: Understanding natural language

1. Understanding embeddings, vector similarity and language models

1.1 Learning dense vectors

2. Context: The bloodstream of GenAI

3. The transformer-encoder architecture

3.1 Self-attention

3.2 Multi-headed self-attention

3.3 The feed-forward neural network

4. Training and fine tuning language models for AI search and NLP

4.1 Training patterns and language model transfer learning

4.2 The AI barrier

5. Elastic Learned Sparse Encoder for AI search out of the box

References

BERT is the most prominent encoder architecture. It was introduced in 2018 and revolutionized NLP by outperforming most benchmarks for natural language understanding and search. Encoders like BERT are the basis for modern AI: translation, AI search, GenAI and other NLP applications.

What is BERT (Bidirectional Encoder Representation from Transformers)?

Vectors are the fundamental construct with which AI understands language. Vectors represent words and are simply long arrays of numerical values. Technically, vectors may also represent sequences of words or sub-word parts.

What are vectors in generative AI?

Understanding generative AI architectures with transformers

Generative AI architectures with transformers explained from the ground up

In this blog, you'll learn how vector search has been integrated into Elasticsearch and the trade-offs that we made.

Vector search is integrated in Elasticsearch through Apache Lucene

Cons

Merges need to recompute HNSW graphs

Searches need to combine results from multiple segments

RAM needs to scale with the size of the data set to retain optimal performance

Pros

Data sets can scale beyond the total RAM size

Lock-free search

Support for incremental changes

Visibility consistency with other data structures

Incremental snapshots

Filtering and hybrid support

Compatibility with other features

Looking ahead: Separation of indexing and search

The main cons of taking advantage of Apache Lucene for vector search come from the fact that Lucene ties vectors to segments. These cons include:  1) Merges need to recompute HNSW graphs. 2) Searches need to combine results from multiple segments. 3) RAM needs to scale with the size of the data set to retain optimal performance. However, tying vectors to segments is also what enables major features such as efficient pre-filtering, efficient hybrid search, and visibility consistency.

What are the cons of using Apache Lucene for vector search?

The pros of taking advantage of Apache Lucene for vector search is that it enables features such as efficient pre-filtering, efficient hybrid search, visibility consistency, among others.

What are the pros of using Apache Lucene for vector search?

Vector search is integrated into Elasticsearch through Apache Lucene.

How is vector search integrated into Elasticsearch?

Vector search in Elasticsearch: Integration & the design rationale

Vector search in Elasticsearch: The rationale behind the design

Learn how scalar quantization can be used to reduce the memory footprint of vector embeddings in Elasticsearch through an experiment.

Understanding scalar quantization in Elasticsearch

Experimentation: Evaluating scalar quantization

Overview of methodology

Results

The benefits of using scalar quantization in Elasticsearch include reducing the memory footprint of vector embeddings without significantly affecting retrieval performance.

ML Research

Evaluating scalar quantization in Elasticsearch