Jan 11, 2024

How to test RAG systems

This was originally posted to the Gentrace blog and has been reposted here for archiving.

RAG systems can be used to

Search the internet: like in ChatGPT Web Browsing , which uses RAG to query the public internet and an OpenAI model to generate a response to a query
Query knowledge: like Mem.ai and Notion QA , which query your knowledge base to generate a response
Summarize datasets: like Amazon’s Customer Review Highlights , which creates a high-level summary of all customer feedback about a product

For generative AI developers, RAG can be an easy way to stand up a quick MVP (“Wow, look at how our customers can now chat with their data in our product!”). But, once they get something up and running, developers struggle to systematically make changes to improve quality and reduce hallucination.

While it’s still a new field, best practices for testing RAG systems have already emerged and can be gradually implemented as follows.

Setup work: create some test data

In any of the scenarios below, you’ll need some test data.

Generally, a good test system consists of the following:

Access to a large data system on which you’ll be doing retrieval
A suite of 10+ test queries and good responses, tagged with their user ID or any other metadata necessary to handle the query.

For example, let’s say we wanted to implement a Q&A for customers of our KB product.

The test suite might look like:

Basic testing system: ask GPT-4 if the facts are right

Once you have your test suite, ask a model (eg GPT-4) if the facts are right. This technique, known as AI evaluation, requires some work to get right.

Frst, enrich your test dataset with good expected answers:

Then, programatically generate outputs using your RAG pipeline, and grade them using a prompt similar to the following.

You are comparing a submitted answer to an expert answer on a given question.

Here is the data:

[BEGIN DATA]
************
[Question]: {{input}}
************
[Expert]: {{expected}}
************
[Submission]: {{output}}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer.

Ignore any differences in style, grammar, or punctuation.

The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies.

Answer the question by selecting one of the following options: 
(A) The submitted answer is a subset of the expert answer and is fully consistent with it. 
(B) The submitted answer is a superset of the expert answer and is fully consistent with it. 
(C) The submitted answer contains all the same details as the expert answer. 
(D) There is a disagreement between the submitted answer and the expert answer. 
(E) The answers differ, but these differences don't matter from the perspective of factuality

Credit: OpenAI evals in their fact evaluation

This prompt converts the input query, expected value, and actual output into a grade between A and E.

Once you have automated grading working, compare and contrast different experiments to systematically get better performance.

Optional enhancements

With the basic pipeline, you’ll probably notice that it’s difficult to differentiate “retrieval” system failures and “generation” system failures.

Furthermore, the “Fact” evaluator does not do a good job of understanding the completeness of the response.

Let’s upgrade our testing system to the below:

Enhancement #1: Test specific stages of the RAG pipeline

This enhancement solves the issue of “where did my system break.”

To understand where failures are happening at-a-glance, extend your automated output collection to collect the outputs of each step.

Then, setup individual evaluations on each step.

For example, for the retrieval step you may want to use Ragas’ context precision and recall evaluations to see how “good” your retrieval is performing.

And if you’re dealing with more complex RAG scenarios, the two main steps may have sub-steps. For example, your retrieval step might consist of one sub-step for SQL query generation and one for query execution. Test each sub-step as necessary.

Enhancement #2: break down the Fact evaluation into pieces

The “fact” evaluation above is a good starting point for RAG evaluation.

However, if you need to improve performance, I recommend breaking it down into a pair of evaluations.

Compliance

This evaluation fails when there are new or contradictory facts in a generated output.

You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{ input }}
************
[Expert]: {{ expected }}
************
[Submission]: {{ output }}
************
[END DATA]

Compare the compliance of the facts of the submitted answer with the expert answer.

Ignore any differences in style, grammar, or punctuation. Also, ignore any missing information in the submission; we only care if there is new or contradictory information. 

Select one of the following options:
(A) All facts in the submitted answer are consistent with the expert answer.
(B) The submitted answer contains new information not present in the expert answer.
(C) There is a disagreement between the submitted answer and the expert answer.

Completeness

This evaluation fails when the generated output is not as complete in it’s response as the expected.

You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{ input }}
************
[Expert]: {{ expected }}
************
[Submission]: {{ output }}
************
[END DATA]

Compare the completeness of the submitted answer and the expert answer to the question.

Ignore any differences in style, grammar, or punctuation. Also, ignore any extra information in the submission; we only care that the submission completely answers the question. 

Select one of the following options:
(A) The submitted answer completely answers the question in a way that is consistent with the expert answer.
(B) The submitted answer is missing information present in the expert answer, but this does not matter for completeness.
(C) The submitted answer is missing information present in the expert answer, which reduces the completeness of the response.
(D) There is a disagreement between the submitted answer and the expert answer.

Credits

I’m Doug Safreno, co-founder and CEO at Gentrace. Gentrace helps automate test and production evaluations like the ones mentioned above for technology companies leveraging generative AI. It is useful to improve quality of LLM outputs at scale.

Thank you to our customers and partners for helping me with this post. Also, thank you to the OpenAI Evals contributors and Ragas contributors for providing inspiration for many of the ideas.