Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely directly compared to the quality of retrieved answers. As models evolve and prompts are modified, we have no systematic way to measure improvements without resorting to expensive human judgments. To address this problem we adapt standard retrieval benchmarks to evaluate answers generated by large language models. Inspired by the BERTScore metric for summarization, we explore two approaches. In the first, we base our evaluation on the benchmark relevance judgments. We empirically run experiments on how information retrieval relevance judgments can be utilized as an anchor to evaluating the generated answers. In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments. In both cases, we measure the similarity between an embedded representation of the generated answer and an embedded representation of a known, or assumed, relevant passage from the retrieval benchmark.
翻译:大型语言模型现能直接生成许多事实性问题的答案,无需引用外部来源。然而,目前鲜有研究关注如何评估这些答案的质量与正确性、比较不同模型或不同提示(prompt)之间的性能表现。此外,生成答案的质量也很少直接与检索答案的质量进行对比。随着模型演进和提示修改,我们缺乏系统性的方法衡量改进效果,只能依赖昂贵的人工评判。为解决此问题,我们通过适配标准检索基准来评估大型语言模型生成的答案。受用于文本摘要的BERTScore指标启发,我们探索了两种方法。第一种方法基于基准相关性判断进行评估:我们通过实验探究如何利用信息检索相关性判断作为锚点来评估生成答案。第二种方法将生成答案与多样化检索模型(涵盖传统方法与先进技术)检索到的顶级结果进行对比,从而在不依赖人工评判的情况下衡量改进效果。在两种方法中,我们均通过测量生成答案的嵌入表示与检索基准中已知或假定相关段落嵌入表示之间的相似度来进行评估。