A recent focus of large language model (LLM) development, as exemplified by generative search engines, is to incorporate external references to generate and support their claims. However, evaluating the attribution, i.e., verifying whether the generated statement is indeed fully supported by the cited reference, remains an open problem. Although human evaluation is common practice, it is costly and time-consuming. In this paper, we investigate the automatic evaluation of attribution by LLMs. We begin by providing a definition of attribution and then explore two approaches for automatic evaluation: prompting LLMs and fine-tuning smaller LMs. The fine-tuning data is repurposed from related tasks, such as question answering, fact-checking, natural language inference, and summarization. To facilitate the evaluation, we manually curate a set of test examples covering 12 domains from a generative search engine, New Bing. Our results on the curated test set and simulated test examples from existing benchmark questions highlight both promising signals as well as remaining challenges for the automatic evaluation of attribution. We hope our testbed, modeling methodology, and insights will help lay the foundation for future studies on this important problem.
翻译:近期大语言模型(LLM)开发的一个关注焦点(如生成式搜索引擎所示)在于整合外部引用以生成并支持其主张。然而,评估归因——即验证生成的陈述是否确实完全得到所引用文献的支持——仍是一个开放性问题。尽管人工评估是常见做法,但其成本高昂且耗时。本文研究了大语言模型对归因的自动评估。我们首先给出归因的定义,随后探讨两种自动评估方法:提示大语言模型与微调小型语言模型。微调数据来自相关任务的重新利用,例如问答、事实核查、自然语言推理和摘要生成。为便于评估,我们人工整理了一组涵盖生成式搜索引擎New Bing中12个领域的测试样例。我们在整理的测试集以及现有基准问题的模拟测试样例上的结果,既显示了自动评估归因的前景信号,也揭示了其面临的剩余挑战。希望我们的测试平台、建模方法以及洞见能为未来关于这一重要问题的研究奠定基础。