Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of $67.9$ compared to $72.2$ for standard attention, $69.6$ for Twicing Attention, and $69.0$ for a parameter-matched wider baseline, with two rounds capturing most of the benefit.
翻译:Transformer注意力机制计算了一个基于值的单一softmax加权平均——这是一种无法纠正自身错误的一次性估计。我们引入了**梯度提升注意力**,它在单个注意力层内应用梯度提升原则:第二次注意力传递(拥有自身学习到的投影)关注第一次传递的预测误差,并施加门控修正。在平方重建目标下,该构造映射到Friedman的梯度提升机中,其中每次注意力传递充当基学习器,每个维度的门控充当收缩参数。我们证明,单次Hopfield式更新会擦除所有与存储模式子空间正交的查询信息,且在局部收缩条件下进一步迭代可使同一区域中的不同查询坍缩至同一不动点。我们还证明,修正传递的独立投影能够恢复Tukey双重法中共享投影方法无法访问的残差信息。在WikiText-103的1000万token子集上,梯度提升注意力达到了67.9的测试困惑度,而标准注意力为72.2,双重注意力为69.6,参数匹配的更宽基线为69.0,其中两轮注意力捕获了大部分收益。