What is the most brute-force way to install interpretable, controllable features into a model's activations? Controlling how LLMs internally represent concepts typically requires sophisticated methods to first identify, then intervene on the model's existing feature geometry. We bypass all of this. We finetune an LLM with a simple auxiliary loss, training 16 of its 3072 residual stream dimensions to be inert interpretability flags that simply indicate what concepts are required for generation. The model reorganizes around them anyway, learning to rely on these flags during actual generation tasks. As a result, these inert flags become genuine internal features: interpretable control switches that allow us to steer generation at inference time. Why does this work? When a feature is reliably supplied at a fixed location, gradient descent gradually eliminates redundant encodings elsewhere, and the model erodes its own alternative representations. A model's efficiency pressure is a lever - exploitable to induce interpretable, controllable representations.
翻译:为模型激活安装可解释、可控特征的最直接方法是什么?控制大型语言模型内部表征概念通常需要复杂方法来先识别、再干预模型现有的特征几何结构。我们完全绕过了这些步骤。我们通过简单的辅助损失对大型语言模型进行微调,将其3072个残差流维度中的16个训练为惰性可解释性标志——这些标志仅指示生成所需的概念。模型仍会围绕这些标志进行重组,在实际生成任务中学会依赖这些标志。因此,这些惰性标志转变为真正的内部特征:成为可在推理阶段引导生成的可解释控制开关。为何这种方法有效?当特征被可靠地固定在特定位置提供时,梯度下降会逐步消除其他位置的冗余编码,模型将自行侵蚀其替代表征。模型的效率压力如同杠杆——可利用其诱导出可解释、可控的表征。