We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$ with $M$ being the sample size, depending only on the smoothness $\beta$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
翻译:本研究探讨了单层注意力风格模型中成对交互学习的收敛速率,其中词元通过权重矩阵和非线性激活函数进行交互。我们证明了极小极大速率为$M^{-\frac{2\beta}{2\beta+1}}$,其中$M$为样本量,该速率仅取决于激活函数的平滑度$\beta$,且关键地独立于词元数量、环境维度或权重矩阵的秩。这些结果揭示了注意力风格非局部模型具有本质的维度无关统计效率,即使当权重矩阵与激活函数不可单独辨识时亦然,并为注意力机制及其训练提供了理论理解。