When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.
翻译:当Transformer被训练用于需要理解层级结构的任务时,研究者发现它们通过不同方式表征这种层级性:残差流空间的几何结构,以及维持后进先出顺序的类栈注意力模式。然而,这些表征是因果性地被使用,还是仅仅可被解码,目前仍不明确。我们以Dyck语言(一种平衡括号序列的形式语言)训练的Transformer为研究对象,探究这一认知空白——该语言中层级真值显式存在。通过对残差流和注意力模式进行探针分析与干预实验,我们发现深度、距离与栈顶信号均可被解码,但其因果作用存在差异。具体而言,屏蔽对真实栈顶位置的注意力会导致长距离准确率急剧下降,而消融低维残差流子空间的影响相对较小。这些结论可推广至模板化自然语言场景,表明即便在相关层级变量已知的可控环境中,可解码性本身并不等同于因果使用。