LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision. We release our tool (GenAudit) and fact-checking model for public use.
翻译:即便在提供参考文档的情况下,大语言模型仍可能生成与事实不符的陈述。在医疗或金融等高风险场景(如基于文档的问答系统)中,此类错误可能引发严重后果。本文提出GenAudit——一款旨在辅助对基于文档任务中的大语言模型回答进行事实核查的工具。GenAudit通过修改或删除参考文档中未获支持的表述来建议编辑方案,同时为确有依据的事实提供来自参考文档的证据。我们训练模型执行这些任务,并设计交互式界面以向用户呈现编辑建议与证据。人类评估员的综合评测表明,GenAudit在总结不同领域文档时,能够检测8种大语言模型输出中的错误。为确保系统能标记绝大多数错误,我们提出一种在最大限度降低对精确率影响的同时提高错误召回率的方法。我们公开发布GenAudit工具及其事实核查模型以供公众使用。