Recent research suggests that looped Transformers have superior reasoning capabilities compared to standard deep architectures. Current approaches to training single-head looped architectures on benchmark tasks frequently fail or yield suboptimal performance due to a highly non-convex and irregular loss landscape. In these settings, optimization often stagnates in poor local minima and saddle points of the loss landscape, preventing the model from discovering the global minimum point. The internal mechanisms of these single-head looped transformer models remain poorly understood, and training them from scratch remains a significant challenge. In this paper, we propose a novel training framework that leverages Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape. By treating the parameter updates as a physical flow, we successfully trained a single-head looped Transformer with model dimension $d = 8$ to solve induction head task with input sequence length of 1000 tokens. This success reveals the internal mechanism behind the superior reasoning capability.
翻译:近期研究表明,循环Transformer相较于标准深度架构具有更优越的推理能力。当前在基准任务上训练单头循环架构的方法常因高度非凸且不规则的损失曲面而失败或产生次优性能。在此类场景中,优化过程往往停滞于损失曲面的不良局部极小点和鞍点,阻碍模型发现全局最优点。这些单头循环Transformer模型的内部机制仍不甚明晰,从头训练此类模型仍是重大挑战。本文提出一种新颖的训练框架,利用Tsallis熵与哈密顿动力学来改变损失曲面的几何结构。通过将参数更新视为物理流动,我们成功训练了模型维度$d = 8$的单头循环Transformer,使其能够处理输入序列长度为1000个标记的归纳头任务。这一成功揭示了其卓越推理能力背后的内部机制。