Complex problems, whether in math, logic, or planning, are solved by humans through a sequence of steps where the result of one step informs the next. In this work, we adopt the perspective that the reasoning power of Transformers is fundamentally limited by a fixed maximum number of steps along any latent path of computation. To address this, we introduce Turbo Connection (TurboConn), a novel architecture that overcomes the fixed-depth constraint by routing multiple residual connections from the higher-layer hidden states of each token $t$ to the lower layers of token $t+1$. Fine-tuning pre-trained LLMs with our method not only yields accuracy gains of 0.9% to over 10% on benchmarks like GSM8K, Parity, and multi-step arithmetic, but also demonstrates that the density of these backward connections is critical; our dense interaction significantly outperforms "sparse" alternatives that only pass a single hidden state or vector. Notably, TurboConn can be integrated into pre-trained LLMs to overcome task-specific plateaus: while a fine-tuned Qwen-3-1.7B achieves only 53.78% on Parity, adding our architectural modification enables the model to reach 100% accuracy, all without the necessity to retrain the full model from scratch or sophisticated curriculum learning. Our results provide strong empirical evidence that the depth of the computational path is a key factor in reasoning ability, also offering a new mechanism to enhance LLMs without significantly affecting generation latency.
翻译:复杂问题,无论是数学、逻辑还是规划问题,人类都是通过一系列步骤解决的,其中每一步的结果都为下一步提供信息。在本研究中,我们采用这样的视角:Transformer 的推理能力从根本上受到任何潜在计算路径上固定最大步数的限制。为解决这一问题,我们提出了 Turbo Connection(TurboConn),这是一种新颖的架构,它通过将每个词元 $t$ 的高层隐藏状态通过多个残差连接路由到词元 $t+1$ 的低层,从而克服了固定深度的限制。使用我们的方法对预训练的大语言模型进行微调,不仅在 GSM8K、奇偶校验和多步算术等基准测试上获得了 0.9% 到超过 10% 的准确率提升,而且证明了这些反向连接的密度至关重要;我们的密集交互显著优于仅传递单个隐藏状态或向量的"稀疏"替代方案。值得注意的是,TurboConn 可以集成到预训练的大语言模型中,以克服特定任务的瓶颈:虽然经过微调的 Qwen-3-1.7B 在奇偶校验任务上仅达到 53.78% 的准确率,但添加我们的架构修改后,模型能够达到 100% 的准确率,所有这些都无需从头开始重新训练整个模型或采用复杂的课程学习。我们的结果为计算路径的深度是推理能力的关键因素提供了强有力的实证证据,同时也提供了一种在不显著影响生成延迟的情况下增强大语言模型的新机制。