SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG) systems that rely on free-form rewriting can introduce hallucinations and unstable trade-offs between completeness and conciseness, particularly in safety- and compliance-critical settings. Objectives: To evaluate extraction as a hallucination-resistant alternative to rewriting-based RAG and compare strategies that balance precision, recall and safety across document types and model scales. Methods: We compare multiple prompting strategies, including line-number-based source selection, extraction of relevant guideline sentences with explicit safety annotations, and a multi-stage pipeline that refines draft answers using supporting evidence from source guidelines. Experiments are conducted on documents of varying length and structure, including local NHS acute care and oncology guidelines and UK-wide NICE guidelines, using both frontier-scale and locally deployable models. Performance is assessed using automatic metrics and human expert evaluation of relevance and completeness. Results: Line-number selection achieves the strongest results, outperforming direct copying and safety-focused strategies across both large and small models while maintaining high term recall (up to 95%) and close alignment with source text. Safety-oriented approaches improve precision but introduce systematic omissions, while multi-stage filtering further amplifies this trade-off. Performance varies with document structure: line-based extraction excels in protocol-like content, whereas alternative strategies perform better on more verbose documents (up to 97% term recall).

翻译：大语言模型（LLMs）越来越多地被用于访问组织文档，包括标准操作程序（SOPs）、人力资源政策以及机构指南。然而，依赖自由形式重写的检索增强生成（RAG）系统可能引入幻觉，并在完整性与简洁性之间产生不稳定的权衡，尤其是在安全与合规关键场景中。目标：评估提取作为基于重写的RAG的抗幻觉替代方案，并对比在不同文档类型和模型规模下平衡精确率、召回率与安全性的策略。方法：我们比较了多种提示策略，包括基于行号的源选择、提取带有显式安全标注的相关指南句子，以及使用源指南中的支持证据优化草稿答案的多阶段流水线。实验在长度和结构各异的文档上进行，包括英国国家医疗服务体系（NHS）地方急症护理与肿瘤学指南以及英国全国范围内的NICE指南，并使用前沿模型和本地可部署模型。性能通过自动指标以及人类专家对相关性和完整性的评估来衡量。结果：行号选择取得了最优结果，在大型和小型模型中均优于直接复制和安全导向策略，同时保持高术语召回率（高达95%）并与源文本高度一致。安全导向方法提高了精确率，但引入了系统性遗漏，而多阶段过滤进一步放大了这一权衡。性能因文档结构而异：基于行的提取在协议类内容中表现优异，而替代策略在更冗长的文档上表现更佳（术语召回率高达97%）。