Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $ν= I^{-1} \sum_{i=1}^I μ^{(i^*)}$ and a query $x_{\mathrm{q}}(i^*)$, the task decomposes into (i) recall of the relevant component $μ^{(i^*)}$ and (ii) prediction from $(μ_{i^*},x_\mathrm{q})$. We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.
翻译:Transformer通过内容可寻址检索以及利用(原则上)无界长度上下文的能力而表现出色。我们将关联记忆重新构建在概率测度层面,将上下文视为词元上的分布,并将注意力机制视为测度上的积分算子。具体而言,对于混合上下文$ν= I^{-1} \sum_{i=1}^I μ^{(i^*)}$与查询$x_{\mathrm{q}}(i^*)$,该任务可分解为:(i) 相关分量$μ^{(i^*)}$的召回;(ii) 基于$(μ_{i^*},x_\mathrm{q})$的预测。我们研究通过经验风险最小化训练的softmax注意力(非固定核函数),证明在输入密度满足谱假设条件下,由MLP组合的浅层测度论Transformer能够学习召回-预测映射。我们进一步建立了具有相同速率指数(至多乘性常数)的匹配极小极大下界,证明了收敛阶的尖锐性。该框架为设计与分析能够从任意长的分布化上下文中进行召回、且具有可证明泛化保证的Transformer提供了原则性方法。