Transformer-based models generate hidden states that are difficult to interpret. In this work, we aim to interpret these hidden states and control them at inference, with a focus on motion forecasting. We use linear probes to measure neural collapse towards interpretable motion features in hidden states. High probing accuracy implies meaningful directions and distances between hidden states of opposing features, which we use to fit interpretable control vectors for activation steering at inference. To optimize our control vectors, we use sparse autoencoders with fully-connected, convolutional, MLPMixer layers and various activation functions. Notably, we show that enforcing sparsity in hidden states leads to a more linear relationship between control vector temperatures and forecasts. Our approach enables mechanistic interpretability and zero-shot generalization to unseen dataset characteristics with negligible computational overhead. Our implementation is available at https://github.com/kit-mrt/future-motion
翻译:基于Transformer的模型生成的隐藏状态难以解释。在本研究中,我们旨在解释这些隐藏状态并在推理过程中控制它们,重点关注运动预测任务。我们使用线性探针测量隐藏状态中朝向可解释运动特征的神经坍缩现象。高探针准确率意味着相反特征对应的隐藏状态之间存在有意义的方向与距离关系,我们利用这一特性拟合可解释的控制向量,用于推理过程中的激活导向。为优化控制向量,我们采用了具有全连接层、卷积层、MLPMixer层及多种激活函数的稀疏自编码器。值得注意的是,我们发现对隐藏状态施加稀疏性约束能够使控制向量温度与预测结果之间呈现更线性的关系。该方法实现了机理可解释性,并能以可忽略的计算开销对未见过的数据集特征进行零样本泛化。项目实现已发布于 https://github.com/kit-mrt/future-motion