We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks, given by the recently introduced attention-indexed model. Using tools from random matrix theory, spin-glass physics, and approximate message passing, we derive sharp asymptotics for training and test errors, locate interpolation and recovery thresholds, and characterize the limiting spectral distribution of the learned weights. Weight decay induces an implicit nuclear-norm regularization, favoring low-rank query and key matrices. Leveraging this, we compare the standard factorized training of query and key matrices with a direct parameterization in which their product is trained element-wise, revealing the inductive bias introduced by the factorized form. Remarkably, the predicted spectral distribution echoes empirical trends reported in large-scale transformers, offering a theoretical perspective consistent with these phenomena.
翻译:我们研究了在合成高维序列任务上训练的单头绑定注意力层中的经验风险最小化问题,该任务由最近提出的注意力索引模型给出。利用随机矩阵理论、自旋玻璃物理和近似消息传递等工具,我们推导了训练误差和测试误差的尖锐渐近性质,定位了插值和恢复阈值,并刻画了学习权重的极限谱分布。权重衰减引入了隐式的核范数正则化,倾向于低秩的查询矩阵和键矩阵。基于此,我们比较了查询矩阵和键矩阵的标准分解训练与直接参数化(其中它们的乘积被逐元素训练)方式,揭示了分解形式引入的归纳偏置。值得注意的是,预测的谱分布呼应了大规模Transformer模型中报告的经验趋势,为这些现象提供了一个一致的理论视角。