Block-Recurrent Dynamics in Vision Transformers

As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent runtime. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

翻译：随着视觉Transformer（ViTs）成为标准视觉主干网络，对其计算现象学进行机制性解释至关重要。尽管架构线索暗示了动力学结构，但目前尚无成熟框架能将Transformer深度解释为特征明确的流。本研究提出块循环假说（BRH），认为训练后的ViTs具有块循环深度结构，使得原始$L$个块的计算可仅通过$k \ll L$个不同块的循环应用来精确重写。在不同ViT模型中，层间表征相似性矩阵显示出少数连续阶段。为验证这些阶段是否反映真正可复用的计算，我们训练了预训练ViTs的块循环代理模型：面向阶段化Transformer的循环近似模型（Raptor）。在小规模实验中，我们证明随机深度和训练过程促进了循环结构，且这种结构与Raptor模型的精确拟合能力相关。随后通过实证为BRH提供存在性证明：训练出的Raptor模型仅用2个块即可恢复DINOv2在ImageNet-1k线性探测任务中96%的准确率，且保持等效运行时间。最后，我们基于该假说开发了动力学可解释性研究框架，发现：i）在微小扰动下，模型会沿自修正轨迹定向收敛至类别依赖的角域；ii）令牌特异性动态——cls令牌在后期执行急剧的重新定向，而图像块令牌在后期表现出强烈的均值方向趋同性；iii）深度后期更新矩阵的秩崩溃现象，符合低维吸引子收敛特征。综合而言，我们发现ViT深度维度涌现出紧凑的循环计算模式，指向一种低复杂度的规范性解决方案，使得这些模型可通过基于原理的动力学系统分析方法进行研究。