ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. Using synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across six different knowledge-intensive tasks in KILT and SuperGLUE, ARES accurately evaluates RAG systems while using a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our datasets and code for replication and deployment available at https://github.com/stanford-futuredata/ARES.

翻译：传统上，检索增强生成（RAG）系统的评估依赖于对输入查询、待检索段落及待生成回应的人工标注。我们提出ARES（自动化RAG评估系统），从上下文相关性、答案忠实性及答案相关性三个维度评估RAG系统。ARES利用合成训练数据微调轻量级语言模型评判器，以评估RAG各组件的质量。为缓解潜在预测错误，ARES采用少量人工标注数据点进行预测驱动推理（PPI）。在KILT和SuperGLUE覆盖的六种知识密集型任务中，ARES仅需数百条评估阶段的人工标注即可准确评估RAG系统。此外，ARES评判器在领域迁移场景下仍保持有效性——即便评估中使用的查询和/或文档类型发生变化，仍能保证准确性。我们已在https://github.com/stanford-futuredata/ARES 公开数据集及代码，供复现与部署使用。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日