The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We design and build an annotation tool to speed up the labelling procedure and ease the workload of raters. It allows flexible incorporation of automatic results in any stage, e.g. automatically-retrieved evidence. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims with the best F1=0.53. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.
翻译:大语言模型在各种实际应用中的广泛使用,催生了对其输出事实准确性进行验证的机制需求。本文提出了一种端到端的整体解决方案,用于标注LLM生成回复的事实性。该方案包含一个多阶段标注体系,旨在针对LLM输出中的可验证性和事实不一致性生成精细化标签。我们设计并构建了一个标注工具,以加速标注流程并减轻评估者的工作量。该工具支持在任何阶段灵活整合自动化结果(例如自动检索的证据)。我们进一步构建了一个开放域文档级事实性基准,包含声明、句子和文档三个粒度等级。初步实验表明,FacTool、FactScore和Perplexity.ai在识别虚假声明方面表现欠佳,最佳F1值仅为0.53。标注工具、基准及代码已开源发布于https://github.com/yuxiaw/Factcheck-GPT。