Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: "overthinking" and "false induction heads". The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.
翻译:现代语言模型能够通过少样本学习模仿复杂模式,从而在无需微调的情况下完成具有挑战性的任务。然而,若上下文存在不准确或有害内容,模仿也可能导致模型复现这些错误。我们从模型内部表征的视角研究有害模仿,并识别出两个相关现象:“过度揣测”与“虚假归纳头”。第一个现象“过度揣测”出现在我们从中间层解码预测结果时,对比正确与错误的少样本示范:在早期层中,两种示范诱发相似的模型行为,但行为在某个“关键层”后急剧分化,此后基于错误示范的准确性逐步下降。第二个现象“虚假归纳头”可能是过度揣测的机制性成因——这些位于后期层的注意力头会关注并复制先前示范中的虚假信息,消融它们可减轻过度揣测。除科学认知外,我们的结果表明:研究模型中间层计算过程,或成为理解并防范有害模型行为的有前景方向。