We introduce Inference-Time Intervention (ITI), a technique designed to enhance the truthfulness of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
翻译:我们提出推理时间干预(Inference-Time Intervention, ITI),一种旨在增强大型语言模型(LLMs)真实性的技术。ITI通过在推理过程中沿有限注意力头的一组方向偏移模型激活值来运作。这种干预显著提升了LLaMA模型在TruthfulQA基准上的表现。在名为Alpaca的指令微调LLaMA模型上,ITI将其真实性从32.5%提升至65.1%。我们识别出真实性与有用性之间的权衡,并展示了如何通过调节干预强度来平衡二者。ITI具有最小侵入性和低计算成本。此外,该技术数据效率高:虽然RLHF等方法需要大量标注,但ITI仅需数百个样本即可定位真实性方向。我们的发现表明,LLMs可能在内部存在对事物真实可能性的表征,即使其表面生成虚假内容。