We introduce Inference-Time Intervention (ITI), a technique designed to enhance the truthfulness of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
翻译:我们提出推理阶段干预(Inference-Time Intervention, ITI),这是一种旨在提升大语言模型(LLMs)真实性的技术。ITI通过在推理过程中沿限定数量注意力头的一组方向偏移模型激活值来运作。该干预显著提升了LLaMA模型在TruthfulQA基准测试上的表现。在名为Alpaca的指令微调LLaMA模型上,ITI将其真实性从32.5%提升至65.1%。我们识别出真实性与有用性之间的权衡,并展示了如何通过调节干预强度来平衡二者。ITI具有最小侵入性且计算开销低廉。此外,该技术数据效率高:与需要大量标注数据的RLHF等方法不同,ITI仅需数百个示例即可定位真实性方向。我们的发现表明,即使LLMs表面上生成虚假信息,其内部可能仍保有事物真实可能性的表征。