Transformer-based models generate hidden states that are difficult to interpret. In this work, we analyze hidden states and modify them at inference, with a focus on motion forecasting. We use linear probing to analyze whether interpretable features are embedded in hidden states. Our experiments reveal high probing accuracy, indicating latent space regularities with functionally important directions. Building on this, we use the directions between hidden states with opposing features to fit control vectors. At inference, we add our control vectors to hidden states and evaluate their impact on predictions. Remarkably, such modifications preserve the feasibility of predictions. We further refine our control vectors using sparse autoencoders (SAEs). This leads to more linear changes in predictions when scaling control vectors. Our approach enables mechanistic interpretation as well as zero-shot generalization to unseen dataset characteristics with negligible computational overhead.
翻译:基于Transformer的模型生成的隐藏状态通常难以解释。本研究聚焦于运动预测任务,对隐藏状态进行分析并在推理阶段对其进行修改。我们采用线性探测方法分析隐藏状态中是否编码了可解释特征。实验结果表明探测准确率较高,揭示了潜在空间中存在具有功能重要性的方向性规律。在此基础上,我们利用具有对立特征的隐藏状态之间的方向来拟合控制向量。在推理阶段,我们将控制向量叠加到隐藏状态上,并评估其对预测结果的影响。值得注意的是,这种修改能够保持预测结果的可行性。我们进一步使用稀疏自编码器(SAEs)对控制向量进行优化,使得在缩放控制向量时预测结果呈现更线性的变化。该方法不仅支持机制性解释,还能以可忽略的计算开销实现对新数据集特征的零样本泛化。