Cross-layer reuse of early attention projections can improve optimization and data efficiency, but it creates a structural conflict: the first layer must simultaneously act as a stable, reusable anchor for all deeper layers and as an effective computational block. We demonstrate that this tension constrains the performance of internal-anchor designs. We propose ExoFormer, which resolves the conflict by learning exogenous anchor projections outside the sequential layer stack. We introduce a unified normalized mixing framework that mixes queries, keys, values, and gate logits using learnable coefficients (exploring coefficient granularities: elementwise, headwise, and scalar), and we show that normalizing anchor sources is key to stable reuse. ExoFormer variants consistently outperform their internal-anchor counterparts, and the dynamic variant yields 1.5x downstream accuracy points while matching validation loss using 1.5x fewer tokens than Gated Attention. We explain this efficacy via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in feature transformation. We release code and models to facilitate future research.
翻译:早期注意力投影的跨层复用能够提升优化效率与数据效率,但会引发结构冲突:第一层必须同时充当所有更深层的稳定可复用锚点,以及作为有效的计算模块。我们证明这种张力限制了内部锚点设计的性能。我们提出ExoFormer,通过在学习序列层堆栈之外的外生锚点投影来解决这一冲突。我们引入统一的归一化混合框架,该框架使用可学习系数(探索系数粒度:逐元素、逐头与标量)混合查询、键、值与门控逻辑值,并证明对锚点源进行归一化是实现稳定复用的关键。ExoFormer变体在性能上始终优于对应的内部锚点模型,其中动态变体在使用比门控注意力少1.5倍词元的情况下达到同等验证损失,同时获得1.5倍的下游准确率提升。我们通过卸载假说解释其有效性:外部锚点保留了必要的词元身份信息,使各层能够专注于特征变换。我们公开代码与模型以促进后续研究。