Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely directly compared to the quality of retrieved answers. As models evolve and prompts are modified, we have no systematic way to measure improvements without resorting to expensive human judgments. To address this problem we adapt standard retrieval benchmarks to evaluate answers generated by large language models. Inspired by the BERTScore metric for summarization, we explore two approaches. In the first, we base our evaluation on the benchmark relevance judgments. We empirically run experiments on how information retrieval relevance judgments can be utilized as an anchor to evaluating the generated answers. In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments. In both cases, we measure the similarity between an embedded representation of the generated answer and an embedded representation of a known, or assumed, relevant passage from the retrieval benchmark.

翻译：大型语言模型现能直接生成许多事实性问题的答案，无需引用外部来源。然而，目前鲜有研究关注如何评估这些答案的质量与正确性、比较不同模型或不同提示（prompt）之间的性能表现。此外，生成答案的质量也很少直接与检索答案的质量进行对比。随着模型演进和提示修改，我们缺乏系统性的方法衡量改进效果，只能依赖昂贵的人工评判。为解决此问题，我们通过适配标准检索基准来评估大型语言模型生成的答案。受用于文本摘要的BERTScore指标启发，我们探索了两种方法。第一种方法基于基准相关性判断进行评估：我们通过实验探究如何利用信息检索相关性判断作为锚点来评估生成答案。第二种方法将生成答案与多样化检索模型（涵盖传统方法与先进技术）检索到的顶级结果进行对比，从而在不依赖人工评判的情况下衡量改进效果。在两种方法中，我们均通过测量生成答案的嵌入表示与检索基准中已知或假定相关段落嵌入表示之间的相似度来进行评估。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日