Evaluating Retrieval Augmented Generation

April 24, 2026 Stefan Berndorfer, Mindbreeze Professional Services

Why evaluation metrics matter

Deploying generative AI in enterprise workflows is about understanding how those answers are generated and how well the system performs. In Retrieval Augmented Generation (RAG) pipelines such as those configured in Mindbreeze InSpire, a query model retrieves relevant information from indexed data sources, and a generative model uses that information to produce a response.

If relevant context is not retrieved, even a strong language model cannot produce a correct answer. Likewise, if the generated response does not align with the retrieved information or the expected answer, the overall quality suffers. This is why systematic evaluation is essential.

Mindbreeze InSpire provides built-in evaluation capabilities that allow administrators to test pipelines against datasets. These evaluations make it possible to assess both retrieval and generation performance, review results in detail, and export enriched datasets for further analysis. Depending on the evaluation mode, datasets can be enriched with retrieved context, generated answers, or both, and can be exported as JSON or CSV.

Available metrics and what they mean for customers

The Mindbreeze evaluation module offers several built‑in metrics for assessing RAG pipelines. These metrics cover both machine‑translation style scores and RAG‑specific indicators:

Metric (short name)	What it measures (brief)	Why it matters
BLEU (Bilingual Evaluation Understudy)	Compares machine‑generated text against reference translations to estimate translation quality.	Useful for tasks where generative output should closely match a known reference answer; a higher BLEU score indicates strong overlap with the expected response.
ROUGE (Recall‑Oriented Understudy for Gisting Evaluation)	A set of metrics used to evaluate summaries and translations by comparing generated output with human‑written references.	Relevant for summarization and question‑answering tasks; case‑insensitive and suited to measuring overlap with ground‑truth answers.
Context Recall	Measures the proportion of relevant documents or passages that were retrieved.	Ensures that the retrieval stage surfaces all important context for the language model; a high context recall means fewer missing facts.
Factual Correctness	Assesses how factually accurate the generated response is compared with the expected answer.	Indicates whether the model is hallucinating or inventing facts; low scores signal that answers deviate from truth.
Faithfulness	Evaluates how accurately a response aligns with the retrieved context.	A faithful answer is one whose claims are fully supported by the retrieved documents; tracking faithfulness helps minimize hallucinations.

How the metrics work together

Each metric highlights a different aspect of the pipeline. Context Recall focuses on retrieval and indicates whether the system successfully surfaces the relevant documents or information needed to answer a question. BLEU and ROUGE measure similarity between generated and expected answers, which is useful when there is a clearly defined reference.

Factual Correctness goes beyond surface similarity by evaluating whether the generated answer is accurate with respect to the expected response. Faithfulness complements this by checking whether the answer is grounded in the retrieved context.

By combining these perspectives, administrators can better understand whether issues originate from retrieval, generation, or the interaction between both.

Evaluating a RAG pipeline in Mindbreeze

Evaluations are created in the RAG administration interface by selecting a pipeline, a dataset, and an evaluation mode. The available modes are retrieval only, generation only, and retrieval and generation. Metrics can be selected individually, and the evaluation is then executed across all questions in the dataset.

Once completed, the evaluation provides both a general summary and a metric summary. The general summary includes information such as total and completed questions, start and end times, as well as minimum, maximum, and average durations for retrieval and generation. It also shows how many requests were successfully processed.

The metric summary presents aggregated results for each selected metric, including minimum, average, and maximum values, along with the number of successful validations. According to the documentation, metric values range from 1 (best) to 5 (worst), and a value of 5 is considered not valid.

In addition to aggregated results, Mindbreeze provides detailed, question-level insights. For each question, administrators can compare expected answers with generated answers, inspect retrieved contexts, review the queries used during retrieval, and examine how each metric was calculated. This level of transparency supports a deeper understanding of pipeline behavior.

Enriching and exporting datasets

A key feature of the evaluation process is dataset enrichment. When running evaluations, Mindbreeze can augment the original dataset with additional information produced during the evaluation.

If both retrieval and generation are evaluated, the dataset is enriched with retrieved context and generated answers. In retrieval-only mode, it contains retrieved context, and in generation-only mode, it contains generated answers.

These enriched datasets can be saved as new datasets within the system or exported as JSON or CSV files. This makes it possible to reuse evaluation results, perform offline analysis, or iterate on datasets and pipeline configurations.

Improving RAG quality through evaluation

Evaluation in Mindbreeze is designed to support continuous improvement. By analyzing Context Recall, teams can determine whether retrieval is capturing the necessary information and adjust search configuration, constraints, or data sources accordingly. Factual Correctness and Faithfulness help assess how well generated answers align with expected results and retrieved content, making it easier to identify where generation needs refinement.

At the same time, the general summary provides visibility into performance characteristics such as latency and completion rates. This helps identify operational bottlenecks in addition to quality issues.

Because evaluations can be repeated and exported, they also enable comparisons across different pipeline versions, configurations, or model choices.

Conclusion

Mindbreeze InSpire provides a structured framework for evaluating retrieval augmented generation pipelines. With dataset-based evaluations, multiple evaluation modes, and a combination of general and RAG-specific metrics, organizations can systematically assess both retrieval and generation performance.

Metrics such as Context Recall, Factual Correctness, and Faithfulness help determine whether relevant information is retrieved and whether generated answers are accurate and grounded in that information. BLEU and ROUGE add established comparison methods for reference-based evaluation scenarios.

By combining these capabilities with detailed inspection and dataset enrichment, Mindbreeze enables teams to analyze, iterate, and improve their RAG pipelines in a transparent and measurable way.