Large language models (LLMs) can generate long-form and coherent text, but they still frequently hallucinate facts, thus limiting their reliability. To address this issue, inference-time methods that elicit truthful responses have been proposed by shifting LLM representations towards learned "truthful directions". However, applying the truthful directions with the same intensity fails to generalize across different question contexts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization that automatically identifies the optimal intervention intensity tailored to a specific context. LITO explores a sequence of model generations based on increasing levels of intervention intensities. It selects the most accurate response or refuses to answer when the predictions are highly uncertain. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy. The adaptive nature of LITO counters issues with one-size-fits-all intervention-based solutions, maximizing model truthfulness by reflecting internal knowledge only when the model is confident.
翻译:大语言模型能够生成连贯的长篇文本,但仍频繁出现事实幻觉问题,限制了其可靠性。为解决这一问题,现有研究提出了通过将LLM表示向习得的"真实性方向"迁移,在推理阶段生成更真实的回应。然而,以统一强度应用真实性方向的方法难以在不同问题语境中泛化。我们提出LITO(可学习真实性优化干预方法),该方法能自动识别特定语境下的最优干预强度。LITO通过逐步增强干预强度探索模型的生成序列,在预测高度不确定时选择最准确的回应或拒绝作答。在多个LLM和问答数据集上的实验表明,LITO在提升回答真实性的同时保持了任务准确度。该方法的自适应性克服了"一刀切"式干预方案的局限,仅在模型确信时反映其内部知识,从而最大化模型真实性。