Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy -- between half and four-fifths of all heads in a trained model can be removed without measurable loss -- because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task's directional structure, requiring no separate validation phase and no hand-tuned schedule. Two theorems form the theoretical backbone. The first (homeostatic convergence) establishes that the system always reaches a finite stopping configuration that is simultaneously minimal (no redundant heads) and sufficient (no uncaptured directional energy above the threshold). The second (compressed-sensing analogy) provides a geometric upper bound on the number of heads that this configuration can contain, as a function of the spectral complexity of the task. Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis confirm both results: the predicted and observed head counts agree within 12% across all benchmarks, and the final architectures match or exceed BERT-base on distribution-specific tasks while using between three and seven times fewer parameters and no pre-training.
翻译:Transformer架构的设计依赖反复试错:注意力头数量、网络深度以及各头维度在训练开始前便已固定,缺乏数学原理指导选择。这导致了系统性的结构冗余——训练后的模型中,半数至五分之四的注意力头可在无显著性能损失的情况下被移除,因为其架构容量分配并未参照任务的实际需求。本文提出INCRT(增量式Transformer),一种在训练过程中自主确定结构的架构。INCRT从单头起步,当当前配置被证明不足以应对任务时逐次添加一个注意力头,并修剪已冗余的头。每次增长决策由源自任务方向结构的单一几何量驱动,该量可在线计算,无需单独的验证阶段或人工调节的调度策略。两项定理构成理论基础:第一定理(稳态收敛性)证明系统始终收敛至有限终止配置,该配置同时满足最小性(无冗余头)与充分性(无高于阈值的方向能量未被捕获);第二定理(压缩感知类比)以任务谱复杂度为函数,给出该配置所含头部数量的几何上界。SARS-CoV-2变异分类与SST-2情感分析实验验证了两项结果:所有基准测试中预测头数与观测头数的误差在12%以内,最终架构在特定分布任务上达到或超越BERT-base,而参数量减少了3至7倍,且无需预训练。