High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with "Unsure from Here" -- according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response's fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.

翻译：当前大语言模型（LLMs）会对所有提示作出响应。然而，当模型缺乏相关知识或能力时，它们可能产生错误答案——这一问题被称为“幻觉”。我们提出一种替代方案：对大语言模型进行后训练，使其仅在确信答案正确时生成内容，否则（部分）放弃回答。具体而言，我们的方法HALT通过生成能力对齐的后训练数据，编码模型能够可靠生成与无法可靠生成的内容。该数据的生成方式是将预训练大语言模型的回答拆解为事实片段（原子陈述或推理步骤），并利用真实信息识别错误片段。通过可调阈值机制，我们采用两种策略实现能力对齐的微调响应：删除错误片段，或将其替换为“此处不确定”——该阈值允许实践者在响应完整性与片段平均正确性之间进行权衡。我们使用HALT对四个开源模型在传记写作、数学、编程和医学领域进行了三种不同权衡阈值的微调。HALT有效实现了响应完整性与正确性的权衡，将响应片段的平均正确性提升15%，同时相比相关基线，F1分数（响应完整性与正确性的调和平均）平均提升4%。通过将HALT调至最高正确性模式，我们训练出单一的可靠Llama3-70B模型，其在所有四个领域的正确率从51%提升至87%，同时保持了标准微调方法53%的响应完整性。