The advancement of LLMs has significantly boosted the performance of complex long-form question answering tasks. However, one prominent issue of LLMs is the generated "hallucination" responses that are not factual. Consequently, attribution for each claim in responses becomes a common solution to improve the factuality and verifiability. Existing researches mainly focus on how to provide accurate citations for the response, which largely overlook the importance of identifying the claims or statements for each response. To bridge this gap, we introduce a new claim decomposition benchmark, which requires building system that can identify atomic and checkworthy claims for LLM responses. Specifically, we present the Chinese Atomic Claim Decomposition Dataset (CACDD), which builds on the WebCPM dataset with additional expert annotations to ensure high data quality. The CACDD encompasses a collection of 500 human-annotated question-answer pairs, including a total of 4956 atomic claims. We further propose a new pipeline for human annotation and describe the challenges of this task. In addition, we provide experiment results on zero-shot, few-shot and fine-tuned LLMs as baselines. The results show that the claim decomposition is highly challenging and requires further explorations. All code and data are publicly available at \url{https://github.com/FBzzh/CACDD}.
翻译:大型语言模型(LLM)的进步显著提升了复杂长形式问答任务的性能。然而,LLM的一个突出问题是生成不真实的“幻觉”响应。因此,对响应中的每个声明进行归因成为提高事实性和可验证性的常见解决方案。现有研究主要关注如何为响应提供准确的引用,这在很大程度上忽视了识别每个响应的声明或陈述的重要性。为弥补这一差距,我们引入了一个新的声明分解基准,该基准要求构建能够识别LLM响应的原子化且值得核查的声明的系统。具体而言,我们提出了中文原子声明分解数据集(CACDD),该数据集基于WebCPM数据集,并增加了专家标注以确保高质量数据。CACDD包含500个人工标注的问答对,总计4956个原子声明。我们进一步提出了一种新的人工标注流程,并描述了该任务的挑战。此外,我们提供了在零样本、少样本和微调LLM上的实验结果作为基线。结果表明,声明分解任务极具挑战性,需要进一步探索。所有代码和数据均公开于 \url{https://github.com/FBzzh/CACDD}。