Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth. However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts. By probing hidden states, we trace this performance gap to two primary bottlenecks: undifferentiated computation, where the core is forced to adopt a similar computational pattern at every iteration, and information overload, where long-lived and transient information must coexist in a single hidden state. To address the issues, we introduce a Memory-as-State-Highways (MeSH) scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations. Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M-6.9B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06% with 33% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models. Our code is available at https://github.com/LivingFutureLab/MeSH/ .
翻译:递归 Transformer 通过复用参数并多次迭代隐藏状态,将计算深度与参数深度解耦。然而,在等价计算量条件下,参数更少的递归模型往往逊色于非递归模型。通过对隐藏状态进行探针分析,我们将这一性能差距归因于两个主要瓶颈:①无区分计算——核心模块被迫在每次迭代中采用相似的计算模式;②信息过载——长期记忆与瞬时信息必须共存于同一隐藏状态中。为解决这些问题,我们提出了一种“存储器-作为-状态高速路”(MeSH)方案,将状态管理外化至显式记忆缓冲区,并通过轻量路由器动态实现不同迭代间的计算差异化。探针可视化证实,MeSH 通过诱导跨迭代的功能特化成功解决了上述病理现象。在 Pythia 系列模型(160M-6.9B)上,MeSH 增强的递归 Transformer 持续优于递归基线,并在 1.4B 规模上超越其更大的非递归对应模型,在非嵌入参数减少 33% 的情况下将平均下游准确率提升 +1.06%。我们的分析确立了 MeSH 作为构建更强递归模型的可扩展且原则性架构的有效性。相关代码已开源至 https://github.com/LivingFutureLab/MeSH/ 。