INCRT: An Incremental Transformer That Determines Its Own Architecture

Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy -- between half and four-fifths of all heads in a trained model can be removed without measurable loss -- because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task's directional structure, requiring no separate validation phase and no hand-tuned schedule. Two theorems form the theoretical backbone. The first (homeostatic convergence) establishes that the system always reaches a finite stopping configuration that is simultaneously minimal (no redundant heads) and sufficient (no uncaptured directional energy above the threshold). The second (compressed-sensing analogy) provides a geometric upper bound on the number of heads that this configuration can contain, as a function of the spectral complexity of the task. Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis confirm both results: the predicted and observed head counts agree within 12% across all benchmarks, and the final architectures match or exceed BERT-base on distribution-specific tasks while using between three and seven times fewer parameters and no pre-training.

翻译：Transformer架构的设计依赖反复试错：注意力头数量、网络深度以及各头维度在训练开始前便已固定，缺乏数学原理指导选择。这导致了系统性的结构冗余——训练后的模型中，半数至五分之四的注意力头可在无显著性能损失的情况下被移除，因为其架构容量分配并未参照任务的实际需求。本文提出INCRT（增量式Transformer），一种在训练过程中自主确定结构的架构。INCRT从单头起步，当当前配置被证明不足以应对任务时逐次添加一个注意力头，并修剪已冗余的头。每次增长决策由源自任务方向结构的单一几何量驱动，该量可在线计算，无需单独的验证阶段或人工调节的调度策略。两项定理构成理论基础：第一定理（稳态收敛性）证明系统始终收敛至有限终止配置，该配置同时满足最小性（无冗余头）与充分性（无高于阈值的方向能量未被捕获）；第二定理（压缩感知类比）以任务谱复杂度为函数，给出该配置所含头部数量的几何上界。SARS-CoV-2变异分类与SST-2情感分析实验验证了两项结果：所有基准测试中预测头数与观测头数的误差在12%以内，最终架构在特定分布任务上达到或超越BERT-base，而参数量减少了3至7倍，且无需预训练。