Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. Using synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across six different knowledge-intensive tasks in KILT and SuperGLUE, ARES accurately evaluates RAG systems while using a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our datasets and code for replication and deployment available at https://github.com/stanford-futuredata/ARES.
翻译:传统上,检索增强生成(RAG)系统的评估依赖于对输入查询、待检索段落及待生成回应的人工标注。我们提出ARES(自动化RAG评估系统),从上下文相关性、答案忠实性及答案相关性三个维度评估RAG系统。ARES利用合成训练数据微调轻量级语言模型评判器,以评估RAG各组件的质量。为缓解潜在预测错误,ARES采用少量人工标注数据点进行预测驱动推理(PPI)。在KILT和SuperGLUE覆盖的六种知识密集型任务中,ARES仅需数百条评估阶段的人工标注即可准确评估RAG系统。此外,ARES评判器在领域迁移场景下仍保持有效性——即便评估中使用的查询和/或文档类型发生变化,仍能保证准确性。我们已在https://github.com/stanford-futuredata/ARES 公开数据集及代码,供复现与部署使用。