Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.
翻译:判断大语言模型输出是否能够基于证据支撑,是自然语言处理中众多任务(如检索增强生成、摘要、文档支撑对话等)的核心。当前针对此类“事实核查”的方法通常依赖大语言模型逐一验证模型生成内容中的每个片段与潜在证据的匹配性。然而,这一过程计算成本极高,需多次调用大语言模型才能完成单次响应的核查。本研究提出如何构建具备GPT-4级别性能但成本降低400倍的小型模型。我们通过GPT-4构造合成训练数据实现此目标:采用结构化生成流程创建既真实又具有挑战性的事实错误实例。基于此类数据的训练使模型能够核查主张中的每个事实,并识别跨句信息综合。为进行评测,我们将现有数据集整合为基准测试LLM-AggreFact,该基准源自近期关于大语言模型生成内容的事实核查与支撑研究。最佳系统MiniCheck-FT5(7.7亿参数)在所有同等规模系统中表现最优,达到GPT-4的准确率。我们已开源LLM-AggreFact数据集、数据合成代码及模型。