Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_θ(X)$, where $f_θ$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.
翻译:近期代数分析表明,在仅解码器与仅编码器Transformer中,可将查询投影$W_Q$设为单位矩阵而性能无明显衰退。这是因为注意力机制仅通过乘积$XW_Q, XW_K, XW_V$依赖输入$X$,使得基变换可被相邻层吸收并通过网络传播。我们将$W_Q \in \R^{d \times d}$替换为形如$Q(X) = X + f_θ(X)$的非线性残差,其中$f_θ$为含$d^2 + O(d)$参数的瓶颈式MLP。恒等项将非线性锚定于已知良好先验。在GPT-3小型风格模型上的实验表明,该方法较基线模型持续提升(验证对数损失降低$2.40\%$,困惑度降低$6.81\%$),且显著优于增加12.5%非嵌入参数的模型。这些结果激励在大规模与跨模态场景下的进一步研究。