The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We design and build an annotation tool to speed up the labelling procedure and ease the workload of raters. It allows flexible incorporation of automatic results in any stage, e.g. automatically-retrieved evidence. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims with the best F1=0.53. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.
翻译:随着大语言模型在各类实际应用中的广泛使用,亟需建立机制验证其输出的事实准确性。本文提出了一种用于标注LLM生成响应事实性的全流程端到端解决方案,包含多阶段标注方案,可生成关于LLM输出可验证性与事实不一致性的细粒度标签。我们设计并构建了标注工具以加速标注流程、减轻评估人员负担,该工具支持在任意阶段灵活整合自动结果(如自动检索证据)。进一步地,我们构建了开放域文档级事实性基准,涵盖声明、句子、文档三个粒度层次。初步实验表明,FacTool、FactScore和Perplexity.ai在识别虚假声明时表现不佳(最佳F1=0.53)。标注工具、基准及代码已开源至https://github.com/yuxiaw/Factcheck-GPT。