Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We show this ''first-layer tension'' is a hidden limiter of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack, decoupling the anchor role from computational refinement. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise/headwise/scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5 downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in refinement. We release code and models to facilitate future research.
翻译:早期注意力投影的跨层复用能够提升优化效率与数据效率,但会引发结构冲突:第一层必须同时充当所有更深层的稳定、可复用锚点,并作为有效的计算模块。我们证明这种“首层张力”是内部锚点设计中的隐性限制因素。本文提出ExoFormer模型,通过学习序列层堆栈之外的外生锚点投影来解决该冲突,从而将锚点功能与计算优化过程解耦。我们引入统一的归一化混合框架,该框架使用可学习系数(探索系数粒度:逐元素/逐注意力头/标量)混合查询、键、值及门控逻辑,并证明对锚点源进行归一化是实现稳定复用的关键。ExoFormer的多种变体均持续优于对应的内部锚点模型,其中动态变体在使用比门控注意力模型少1.5倍训练词元的情况下,在保持验证损失相当的同时获得了1.5个下游准确率百分点的提升。我们通过“卸载假说”解释其有效性:外部锚点保留了词元的核心身份特征,使各层能够专注于优化计算。我们公开了代码与模型以促进后续研究。