Ordinary Least Squares as an Attention Mechanism

I show that ordinary least squares (OLS) predictions can be rewritten as the output of a restricted attention module, akin to those forming the backbone of large language models. This connection offers an alternative perspective on attention beyond the conventional information retrieval framework, making it more accessible to researchers and analysts with a background in traditional statistics. It falls into place when OLS is framed as a similarity-based method in a transformed regressor space, distinct from the standard view based on partial correlations. In fact, the OLS solution can be recast as the outcome of an alternative problem: minimizing squared prediction errors by optimizing the embedding space in which training and test vectors are compared via inner products. Rather than estimating coefficients directly, we equivalently learn optimal encoding and decoding operations for predictors. From this vantage point, OLS maps naturally onto the query-key-value structure of attention mechanisms. Building on this foundation, I discuss key elements of Transformer-style attention and draw connections to classic ideas from time series econometrics.

翻译：本文证明，普通最小二乘法（OLS）预测可以重写为一个受限注意力模块的输出，类似于构成大语言模型核心的注意力模块。这种关联为注意力机制提供了超越传统信息检索框架的替代视角，使其更易于具有传统统计学背景的研究人员和分析师理解。当OLS被构建为转换后回归变量空间中的基于相似度的方法时（区别于基于偏相关的标准观点），这种对应关系便自然显现。实际上，OLS解可以重新表述为以下替代问题的结果：通过优化嵌入空间来最小化预测误差平方和，在该空间中训练向量与测试向量通过内积进行比较。我们并非直接估计系数，而是等价地学习预测变量的最优编码和解码操作。从这个角度看，OLS自然地映射到注意力机制的查询-键-值结构。在此基础上，本文讨论了Transformer风格注意力的关键要素，并将其与时间序列计量经济学的经典思想联系起来。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

TransMLA：多头潜在注意力（MLA）即为所需

专知会员服务

23+阅读 · 2025年2月13日

【ICML2022】基于随机注意力机制的可解释和广义图学习

专知会员服务

33+阅读 · 2022年8月7日

【AAAI2022】注意力机制的快速蒙特卡罗近似

专知会员服务

20+阅读 · 2022年2月5日

【NeurIPS 2021】流形上的注意力机制：规范等变的Transformer

专知会员服务

30+阅读 · 2021年12月2日