How to build a dataset to evaluate a RAG?

Table of Contents

System Design

Evaluating RAG models involves a systematic process that can be broken down into several steps. Below is a detailed guide based on the key components.

Question Generation
Question Filtering and Finalization
RAG
Evaluate the Outcome

Parameters to optimize

There are a bunch of parameters that could be optimized to improve the quality of the final result. I listed some of them below.

Chunking Retrival Generation
Chunk size Number of results LLM Model
Chunk overlap Similarity threshold Prompt
Embedding model Retrival Strategy (BM25, cosine similarity, hybrid) Temperature
  Reranking  


Metrics

While there are numerous useful metrics, it’s impractical to focus on all of them simultaneously. The table below presents these metrics, which can be further categorized based on evaluation speed: slow (taking approximately 5-10 seconds due to an additional call to the LLM for calculation) and fast. This categorization aids in understanding the time efficiency of each metric during evaluation.

Chunking Retrival Generation
this is difficult to measure, therefore we can only evaluate end-to-end performance. Context Precision (slow): Evaluates whether all of the ground-truth relevant items present in the context are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question and the contexts. BLEU/ROUGE/BERTScore
Mean Average Precision Faithfulness (slow): This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context.
Normalized Discounted Cumulative Gain (nDCG) Answer Relevance (slow): focuses on assessing how pertinent the generated answer is to the given prompt. This metric is computed using the question and the answer.


Problems

Quick wins to boost quality

  1. Embeddings are important to boost quality. Right off the box, you can get a 5-7% increase. You can tune the encoder end-to-end without tuning the LLM, you need to keep an eye on end metric for quality. You can look into MTEB (Massive Text Embedding Benchmark) to compare quality. Tuning with LLM should be the last resort.
  2. MMR(Maximal Marginal Relevance) on top of the Embeddings output increases the diversity of the final hints and has a better impact on the final generation. The idea here is that we need to get not the top N closest chunks, but the top N diverse relevant chuncks that are not paraphrases of each other.
  3. Hierarchical chunks splitting: from small to large. Specifically, we split the documents not into large pieces but maintain a hierarchy of large chunk - subchunks. Then we search specifically by subchunk, and in the prompt, the parent large chunk is returned as feedback. This expansion of context solves the ‘lost in the middle’ issues.

Literature