We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the \emph{tuned lens}, is a refinement of the earlier ``logit lens'' technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.
翻译:我们从迭代推理的角度分析Transformer,旨在理解模型预测如何逐层精炼。为此,我们在冻结的预训练模型中为每个模块训练一个仿射探针,使得能够将每个隐藏状态解码为词汇表上的分布。我们的方法——调谐透镜——是对早期“对数几率透镜”技术的改进,后者虽能提供有用的洞见,但往往不够稳定。我们在参数规模高达200亿的各种自回归语言模型上测试了该方法,证明其相比对数几率透镜具有更强的预测能力、更高的可靠性和更小的偏差。通过因果实验,我们表明调谐透镜使用了与模型本身相似的特征。我们还发现,潜在预测的轨迹可用于高精度检测恶意输入。复现我们结果所需的所有代码可在https://github.com/AlignmentResearch/tuned-lens 获取。