Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: overthinking and false induction heads. The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.
翻译:现代语言模型可以通过少样本学习模仿复杂模式,使其能够在不进行微调的情况下完成具有挑战性的任务。然而,模仿也可能导致模型复制上下文中的不准确或有害内容。我们通过模型内部表示的角度研究有害模仿,并识别出两种相关现象:过度思考和错误归纳头。第一种现象是过度思考,当我们在给定正确与错误少样本演示的情况下从中间层解码预测时出现。在早期层中,两种演示均诱导出相似的模型行为,但在某个“关键层”后行为急剧分化,此后基于错误演示的准确率逐步下降。第二种现象是错误归纳头,这可能是过度思考的一种机制性原因:这些是后期层中关注并复制先前演示中错误信息的注意力头,消除这些头可减少过度思考。除科学认知外,我们的结果表明,研究中间模型计算可能是理解和防范有害模型行为的一个有前景的方向。