We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in self-attention layers, that are key components of modern architectures.
翻译:本文提出注意力索引模型(AIM),这是一个用于分析深度注意力层学习的理论框架。该模型受多索引模型启发,能够刻画高维嵌入中通过分层双线性交互产生词元级输出的机制。与先前可处理的注意力模型不同,AIM允许使用全宽度的键矩阵和查询矩阵,从而更贴近实际Transformer的结构。借助统计力学和随机矩阵理论的工具,我们推导出贝叶斯最优泛化误差的闭式预测,并发现泛化误差随样本复杂度、模型宽度和序列长度变化会出现急剧的相变现象。我们提出了一种匹配的近似消息传递算法,并证明梯度下降能够达到最优性能。AIM为理解现代架构关键组件——自注意力层的学习机制,提供了一个可解析的理论研究平台。