While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.
翻译:尽管循环Transformer(称为Looped-Attn)在复杂推理任务上通常优于标准Transformer(称为Single-Attn),其优势的理论基础仍未得到充分探索。本文通过损失函数景观几何的视角解释这一现象,其灵感来源于对两者在样本层面和海森矩阵层面动态差异的实证观察。为形式化这一分析,我们扩展了河流-山谷景观模型,区分了U型山谷(平坦)与V型山谷(陡峭)。基于实证观察,我们推测Looped-Attn的递归架构在景观层面诱导了偏向河流-V型山谷的归纳偏置。基于该归纳偏置的理论推导保证了沿河流方向因山谷跳跃带来的损失收敛优势,并进一步促进了与Single-Attn诱导的河流-U型山谷相比对复杂模式的学习能力。基于这一洞见,我们提出了SHIFT(渐进式训练的分层递进框架),这是一种分阶段训练框架,可在保持相当性能的同时加速Looped-Attn的训练过程。