We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistical separation from gradient descent, constructing tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be seen as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent states more robustly than linear Transformers. This reframes ICL from "implicit optimization" to "optimal inference," explaining the efficiency of selective SSMs and offering a principled basis for architecture design.
翻译:我们提出贝叶斯最优序列预测作为理解情境学习(ICL)的新原理。与将Transformer解释为执行隐式梯度下降的观点不同,我们将ICL形式化为对潜在序列任务的元学习。对于由线性高斯状态空间模型(LG-SSMs)控制的任务,我们证明经过元训练的选择性SSM渐近地实现了贝叶斯最优预测器,收敛于后验预测均值。我们进一步建立了与梯度下降的统计分离,通过构建具有时间相关噪声的任务,证明最优贝叶斯预测器严格优于任何经验风险最小化(ERM)估计器。由于Transformer可被视为执行隐式ERM,这表明选择性SSM凭借更优的统计效率实现了更低的渐近风险。在合成LG-SSM任务和字符级马尔可夫基准上的实验证实:选择性SSM能更快收敛到贝叶斯最优风险;在结构化噪声设置中,利用更长上下文时表现出更卓越的样本效率;且比线性Transformer更稳健地跟踪潜在状态。这将ICL从"隐式优化"重新定义为"最优推断",既解释了选择性SSM的高效性,也为架构设计提供了理论依据。