Modern neural networks of the transformer family require the practitioner to decide, before training begins, how many attention heads to use, how deep the network should be, and how wide each component should be. These decisions are made without knowledge of the task, producing architectures that are systematically larger than necessary: empirical studies find that a substantial fraction of heads and layers can be removed after training without performance loss. This paper introduces DDCL-INCRT, an architecture that determines its own structure during training. Two complementary ideas are combined. The first, DDCL (Deep Dual Competitive Learning), replaces the feedforward block with a dictionary of learned prototype vectors representing the most informative directions in the data. The prototypes spread apart automatically, driven by the training objective, without explicit regularisation. The second, INCRT (Incremental Transformer), controls the number of heads: starting from one, it adds a new head only when the directional information uncaptured by existing heads exceeds a threshold. The main theoretical finding is that these two mechanisms reinforce each other: each new head amplifies prototype separation, which in turn raises the signal triggering the next addition. At convergence, the network self-organises into a hierarchy of heads ordered by representational granularity. This hierarchical structure is proved to be unique and minimal, the smallest architecture sufficient for the task, under the stated conditions. Formal guarantees of stability, convergence, and pruning safety are established throughout. The architecture is not something one designs. It is something one derives.
翻译:现代Transformer家族中的神经网络要求实践者在训练开始前预先决定注意力头的数量、网络深度以及各组件宽度。这些决策在缺乏任务先验知识的情况下做出,导致生成的架构必然大于实际需求:实证研究发现,相当比例的注意力头和层级可在训练后移除而不影响性能。本文提出DDCL-INCRT架构,该架构能够在训练过程中自主确定其结构。其核心融合了两个互补性思想:其一,DDCL(深度对偶竞争学习)以前馈模块替代方案引入一组学习得到的原型向量字典,这些向量表征数据中最具信息量的方向。在训练目标的驱动下,原型自动分离扩展,无需显式正则化。其二,INCRT(增量式Transformer)控制注意力头的数量:从单头出发,仅当现有头未能捕获的方向信息超过阈值时才新增头。主要理论发现表明,这两种机制相互增强:每个新头都能放大原型分离度,而分离度的提升又反过来增强触发下一头增加的信号。收敛时,网络自组织形成按表征粒度排序的层次化头结构。在给定条件下,该层次结构被证明是唯一且最小的——足以完成任务的最精简架构。全程建立了稳定性、收敛性和剪枝安全性的形式化保证。该架构并非人为设计,而是推导所得。