Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.
翻译:现有增加Transformer有效深度的方法主要依赖参数重用,通过循环执行扩展计算。在此范式下,网络结构在训练过程中保持静态,额外计算深度以参数级方式均匀分配给整个模块。这种训练时间和参数空间上的刚性导致训练过程中产生大量计算冗余。与此相反,我们认为训练时的深度分配不应是静态预设,而应是一个渐进增长的结构化过程。我们的系统分析揭示了各层从深到浅的成熟轨迹,其中高熵注意力头在语义整合中发挥关键作用。基于这一发现,我们提出了稀疏增长Transformer(SGT)。SGT是一个训练时稀疏深度分配框架,通过针对信息型注意力头的定向循环机制,将循环从深层渐进扩展至浅层。该机制通过仅在训练过程中为少量参数子集选择性增加深度,从而产生结构性稀疏性。跨多参数尺度的广泛实验表明,在可比设置下,SGT一致优于训练时静态模块级循环基线,同时将相对于标准Transformer骨干网络的额外训练FLOPs开销从约16-20%降低至仅1-3%。