I show that ordinary least squares (OLS) predictions can be rewritten as the output of a restricted attention module, akin to those forming the backbone of large language models. This connection offers an alternative perspective on attention beyond the conventional information retrieval framework, making it more accessible to researchers and analysts with a background in traditional statistics. It falls into place when OLS is framed as a similarity-based method in a transformed regressor space, distinct from the standard view based on partial correlations. In fact, the OLS solution can be recast as the outcome of an alternative problem: minimizing squared prediction errors by optimizing the embedding space in which training and test vectors are compared via inner products. Rather than estimating coefficients directly, we equivalently learn optimal encoding and decoding operations for predictors. From this vantage point, OLS maps naturally onto the query-key-value structure of attention mechanisms. Building on this foundation, I discuss key elements of Transformer-style attention and draw connections to classic ideas from time series econometrics.
翻译:本文证明,普通最小二乘法(OLS)预测可以重写为一个受限注意力模块的输出,类似于构成大语言模型核心的注意力模块。这种关联为注意力机制提供了超越传统信息检索框架的替代视角,使其更易于具有传统统计学背景的研究人员和分析师理解。当OLS被构建为转换后回归变量空间中的基于相似度的方法时(区别于基于偏相关的标准观点),这种对应关系便自然显现。实际上,OLS解可以重新表述为以下替代问题的结果:通过优化嵌入空间来最小化预测误差平方和,在该空间中训练向量与测试向量通过内积进行比较。我们并非直接估计系数,而是等价地学习预测变量的最优编码和解码操作。从这个角度看,OLS自然地映射到注意力机制的查询-键-值结构。在此基础上,本文讨论了Transformer风格注意力的关键要素,并将其与时间序列计量经济学的经典思想联系起来。