Enhancing the attribution in large language models (LLMs) is a crucial task. One feasible approach is to enable LLMs to cite external sources that support their generations. However, existing datasets and evaluation methods in this domain still exhibit notable limitations. In this work, we formulate the task of attributed query-focused summarization (AQFS) and present WebCiteS, a Chinese dataset featuring 7k human-annotated summaries with citations. WebCiteS derives from real-world user queries and web search results, offering a valuable resource for model training and evaluation. Prior works in attribution evaluation do not differentiate between groundedness errors and citation errors. They also fall short in automatically verifying sentences that draw partial support from multiple sources. We tackle these issues by developing detailed metrics and enabling the automatic evaluator to decompose the sentences into sub-claims for fine-grained verification. Our comprehensive evaluation of both open-source and proprietary models on WebCiteS highlights the challenge LLMs face in correctly citing sources, underscoring the necessity for further improvement. The dataset and code will be open-sourced to facilitate further research in this crucial field.
翻译:增强大型语言模型(LLM)的归因能力是一项关键任务。一种可行的方法是使LLM能够引用支持其生成内容的外部来源。然而,该领域现有的数据集和评估方法仍存在显著局限性。在本工作中,我们定义了属性化查询聚焦摘要(AQFS)任务,并提出了WebCiteS——一个包含7千个人工标注的带引用摘要的中文数据集。WebCiteS源自真实世界的用户查询和网络搜索结果,为模型训练与评估提供了宝贵资源。以往的归因评估工作未能区分事实性错误与引用错误,且在自动验证那些部分依赖多个来源支持的句子方面存在不足。我们通过设计细粒度指标来解决这些问题,使自动评估器能够将句子分解为子主张以进行精细验证。我们在WebCiteS上对开源模型和专有模型的综合评估表明,LLM在正确引用来源方面面临挑战,这凸显了进一步改进的必要性。数据集与代码将开源发布,以促进这一重要领域的深入研究。