Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to a model to check a single response. In this work, we show how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify datasets from recent work on fact-checking and grounding LLM generations into a new benchmark, LLM-AggreFact. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.
翻译:识别大语言模型输出是否能够基于证据支撑,是自然语言处理中诸多任务的核心:检索增强生成、文本摘要、文档支撑对话等。当前针对此类事实核查的方法,通常基于大语言模型将模型生成的每个片段与潜在证据进行验证。然而,这一过程计算成本极高,核查单个响应往往需要多次调用模型。本研究展示了如何构建小型事实核查模型,在保持GPT-4级别性能的同时将成本降低400倍。我们通过GPT-4构建合成训练数据实现这一目标,该过程采用结构化生成方法创建真实且具有挑战性的事实错误实例。基于此数据训练模型,使其能够核查主张中的每个事实,并识别跨句子的信息综合。为进行评估,我们将近期关于大语言模型生成内容事实核查与证据支撑研究的数据集整合为新的基准测试LLM-AggreFact。我们提出的最佳系统MiniCheck-FT5(7.7亿参数)在同等规模系统中表现最优,并达到GPT-4的准确率。我们公开了LLM-AggreFact基准、数据合成代码及模型资源。