Pretrained Transformers can perform in-context learning (ICL) from a few demonstrations, but this ability can fail sharply when the test distribution differs from pretraining, a common deployment setting. We study attention temperature as a simple inference-time control for improving ICL robustness under such shifts. In a high-dimensional linear-regression framework, we analyze a Transformer with "approximate softmax" attention, which preserves softmax's normalization and temperature-dependent selectivity while remaining tractable. We derive a closed-form expression for the ICL generalization error under distribution shift, and show that it is minimized by an explicit optimal attention temperature. This characterization yields interpretable guidance by linking the best temperature to moments of the pre-softmax attention scores, and predicts when temperature adjustment can recover near Bayes-optimal performance. We validate the theory with extensive simulations, and further demonstrate gains on pretrained LLMs (GPT-2 and Llama2-7B) on question-answering benchmarks under distribution shift induced by noisy in-context demonstrations. Overall, attention temperature emerges as a principled, lightweight knob for improving the robustness of ICL in pretrained Transformers.
翻译:预训练Transformer能从少量示例中进行上下文学习(ICL),但在测试分布与预训练分布不同这一常见部署场景下,这种能力可能急剧失效。本文研究注意力温度作为一种简单的推理时调控手段,以增强此类偏移下ICL的鲁棒性。在高维线性回归框架下,我们分析了一种采用“近似softmax”注意力的Transformer,该机制在保持可解性的同时,保留了softmax的归一化特性及温度依赖的选择性。我们推导出分布偏移下ICL泛化误差的闭合表达式,并证明其通过显式最优注意力温度达到最小化。这一表征通过将最优温度与预softmax注意力得分的矩相关联,提供了可解释的指导,并预测了何时温度调整能恢复接近贝叶斯最优的性能。我们通过大量仿真验证了该理论,并在噪声上下文示例引发的分布偏移下,基于预训练大语言模型(GPT-2和Llama2-7B)的问答基准测试中进一步展示了其性能提升。总体而言,注意力温度被视为一种原理简洁、轻量级的调控手段,可用于提升预训练Transformer中ICL的鲁棒性。