While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the mechanism for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. This inductive bias suggest a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a principled training strategy that accelerates the training process of Looped-Attn while achieving comparable performances.
翻译:尽管循环Transformer(称为Looped-Attn)在复杂推理任务上通常优于标准Transformer(称为Single-Attn),其优势机制尚未得到充分探索。本文通过损失地形几何的视角解释这一现象,其灵感来源于对两者在样本层面和海森矩阵层面动态差异的实证观察。为形式化这一分析,我们扩展了河流-山谷地形模型,区分了U型山谷(平坦)与V型山谷(陡峭)。基于实证观察,我们推测Looped-Attn的递归架构会引发一种朝向河流-V型山谷的地形层面归纳偏置。相较于Single-Attn诱导的河流-U型山谷,这种归纳偏置通过山谷跳跃机制促进了沿河流方向的损失收敛优化,并进一步强化了对复杂模式的学习能力。基于这一洞见,我们提出了SHIFT(渐进式训练的分层递进框架),这是一种原则性的训练策略,可加速Looped-Attn的训练过程,同时达到可比性能。