Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, we aim to enable LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare with different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We build automatic metrics along three dimensions -- fluency, correctness, and citation quality -- and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvements -- for example, on the ELI5 dataset, even the best model has 49% of its generations lacking complete citation support. Our extensive analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.
翻译:大型语言模型(LLMs)已成为信息检索的广泛使用工具,但其生成的输出容易产生幻觉。本文旨在使LLMs能够生成带引用的文本,以提高其事实正确性和可验证性。现有工作主要依赖商业搜索引擎和人工评估,导致不同建模方法的复现和比较困难。我们提出了ALCE,首个用于自动评估LLM引用的基准测试。ALCE收集了多样化的问题和检索语料库,要求构建端到端系统以检索支持性证据并生成带引用的答案。我们从流畅性、正确性和引用质量三个维度构建自动评估指标,并证明其与人工判断具有强相关性。我们对最先进的LLMs和新型提示策略的实验表明,当前系统仍有相当大的改进空间——例如,在ELI5数据集上,即使最佳模型也有49%的生成内容缺乏完整的引用支持。我们的深入分析进一步凸显了未来有前景的研究方向,包括开发更优的检索器、推进长上下文LLMs的发展,以及提升从多源信息中综合归纳的能力。