Block-Recurrent Dynamics in Vision Transformers

As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

翻译：随着视觉Transformer（ViT）成为标准的视觉骨干网络，对其计算现象学进行机制性解释至关重要。尽管架构线索暗示了动力学结构，但目前尚无成熟框架能将Transformer的深度解释为特征明确的流。本工作提出块循环假说（BRH），认为训练后的ViT具有块循环深度结构，使得原始$L$个块的计算可被精确重写为仅使用$k \ll L$个不同块的循环应用。在不同ViT模型中，层间表示相似性矩阵显示出少数连续阶段。为验证这些阶段是否反映真正可复用的计算，我们训练了预训练ViT的块循环替代模型：面向阶段结构化Transformer的循环近似模型（Raptor）。在小规模实验中，我们证明随机深度与训练能促进循环结构的形成，且该结构与Raptor模型的拟合精度呈正相关。随后，我们通过训练Raptor模型在仅使用2个块且计算成本相当的条件下，恢复了DINOv2在ImageNet-1k线性探测任务中96%的准确率，为BRH提供了实证存在性证明。最后，我们基于该假说开发了动力学可解释性研究框架。研究发现：i）模型在微小扰动下会沿自修正轨迹定向收敛至类别依赖的角域；ii）令牌特异性动力学——cls令牌在后期执行急剧的重新定向，而图像块令牌在后期表现出强烈的均值方向趋同性；iii）深度后期更新矩阵的秩坍塌现象，符合低维吸引子收敛特征。总体而言，我们发现ViT深度维度会涌现紧凑的循环计算模式，这指向一种低复杂度的规范性解决方案，使得这些模型可通过基于原理的动力学系统分析方法进行研究。