Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We develop automatic metrics along three dimensions -- fluency, correctness, and citation quality -- and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvement -- For example, on the ELI5 dataset, even the best models lack complete citation support 50% of the time. Our analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.
翻译:大型语言模型(LLMs)已成为信息检索的常用工具,但其生成内容易出现幻觉。本研究旨在使LLMs能够生成带引文的文本,提升其事实正确性与可验证性。现有工作主要依赖商业搜索引擎和人工评估,导致不同建模方法难以复现和比较。我们提出ALCE——首个面向LLMs引文自动评估的基准。ALCE收集多样化问题与检索语料库,要求构建端到端系统以检索支撑证据并生成带引文的答案。我们从流畅性、正确性与引文质量三个维度开发自动化评估指标,并验证其与人工评判的高度相关性。基于当前最先进LLMs及新型提示策略的实验表明,现有系统仍有显著改进空间——例如在ELI5数据集上,即使最优模型仍有50%的案例缺乏完整引文支撑。进一步分析揭示了有前景的未来方向,包括开发更优检索器、推进长上下文LLMs研究、以及提升多源信息综合能力。