Bidirectional attention $\unicode{x2013}$ composed of self-attention with positional encodings and the masked language model (MLM) objective $\unicode{x2013}$ has emerged as a key component of modern large language models (LLMs). Despite its empirical success, few studies have examined its statistical underpinnings: What statistical model is bidirectional attention implicitly fitting? What sets it apart from its non-attention predecessors? We explore these questions in this paper. The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights. Further, bidirectional attention with multiple heads and multiple layers is equivalent to stacked MoEs and a mixture of MoEs, respectively. This statistical viewpoint reveals the distinct use of MoE in bidirectional attention, which aligns with its practical effectiveness in handling heterogeneous data. It also suggests an immediate extension to categorical tabular data, if we view each word location in a sentence as a tabular feature. Across empirical studies, we find that this extension outperforms existing tabular extensions of transformers in out-of-distribution (OOD) generalization. Finally, this statistical perspective of bidirectional attention enables us to theoretically characterize when linear word analogies are present in its word embeddings. These analyses show that bidirectional attention can require much stronger assumptions to exhibit linear word analogies than its non-attention predecessors.
翻译:双向注意力——由带位置编码的自注意力与掩码语言模型(MLM)目标组成——已成为现代大型语言模型(LLMs)的关键组成部分。尽管其经验成功显著,但鲜有研究探讨其统计基础:双向注意力隐式拟合了何种统计模型?它与其非注意力前身有何区别?本文旨在探索这些问题。关键发现是:在重新参数化后,拟合单层单头双向注意力等价于拟合具有混合专家(MoE)权重的连续词袋(CBOW)模型。进一步地,多头或多层双向注意力分别等价于堆叠的MoEs或MoEs的混合。这一统计视角揭示了双向注意力中MoE的独特运用方式,与其在处理异质性数据时的实际有效性相吻合。若将句子中的每个词位置视为表格特征,该视角亦提示可将其直接扩展到类别型表格数据。实证研究表明,该扩展在分布外(OOD)泛化性能上优于现有的Transformer表格数据扩展方法。最后,双向注意力的这一统计视角使我们能够从理论上刻画其词嵌入中出现线性词类比的条件。分析表明,与非注意力前身相比,双向注意力需要更强的假设才能展现线性词类比现象。