Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in the regime where both the number of cross-attention layers and the context length are large. We show that this cross-attention mechanism is provably Bayes optimal when optimized using gradient flow. Our results underscore the benefits of depth for in-context learning and establish the provable utility of cross-attention for multi-modal distributions.
翻译:近期研究进展迅速推动了我们对现代注意力神经网络中上下文学习机制的理解。然而,现有成果仅聚焦于单模态数据,而多模态数据中上下文学习的理论基础仍鲜有探索。我们提出了一个数学可处理的多模态学习分析框架,并探究了类Transformer架构在上下文中恢复贝叶斯最优性能的条件。为建模多模态问题,我们假设观测数据源自潜在因子模型。首个结果揭示了表达能力方面的局限性:我们证明单层线性自注意力机制无法在任务分布上一致地恢复贝叶斯最优预测器。针对这一不足,我们提出了一种新型线性化交叉注意力机制,并在其层数与上下文长度均较大的场景下展开研究。结果表明,当采用梯度流优化时,该交叉注意力机制具有可证明的贝叶斯最优性。我们的研究突出了深度对上下文学习的增益价值,并确立了交叉注意力在多模态分布中的可证实用性。