How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for "my colour" vs. "opponent's colour" may be a simple yet powerful way to interpret the model's internal state. This precise understanding of the internal representations allows us to control the model's behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.
翻译:序列模型如何表征其决策过程?先前研究表明,奥赛罗博弈神经网络学习了棋盘状态的非线性模型(Li等人,2023)。本研究中,我们提供了棋盘存在紧密相关的线性表示的证据。具体而言,我们发现探测"我方颜色"与"对方颜色"可能是解读模型内部状态的简单而强大的方法。这种对内部表示的精确理解使我们能够通过简单的向量运算控制模型行为。线性表示促进了可解释性的重大进展,我们通过对世界模型计算方式的进一步探索来证明这一点。