We study in-context learning problems where a Transformer is pretrained on tasks drawn from a mixture distribution $π=\sum_{α\in\mathcal{A}} λ_α π_α$, called the pretraining prior, in which each mixture component $π_α$ is a distribution on tasks of a specific difficulty level indexed by $α$. Our goal is to understand the performance of the pretrained Transformer when evaluated on a different test distribution $μ$, consisting of tasks of fixed difficulty $β\in\mathcal{A}$, and with potential distribution shift relative to $π_β$, subject to the chi-squared divergence $χ^2(μ,π_β)$ being at most $κ$. In particular, we consider nonparametric regression problems with random smoothness, and multi-index models with random smoothness as well as random effective dimension. We prove that a large Transformer pretrained on sufficient data achieves the optimal rate of convergence corresponding to the difficulty level $β$, uniformly over test distributions $μ$ in the chi-squared divergence ball. Thus, the pretrained Transformer is able to achieve faster rates of convergence on easier tasks and is robust to distribution shift at test time. Finally, we prove that even if an estimator had access to the test distribution $μ$, the convergence rate of its expected risk over $μ$ could not be faster than that of our pretrained Transformers, thereby providing a more appropriate optimality guarantee than minimax lower bounds.
翻译:我们研究上下文学习问题,其中Transformer在从混合分布$π=\sum_{α\in\mathcal{A}} λ_α π_α$(称为预训练先验)中抽取的任务上进行预训练,其中每个混合分量$π_α$是特定难度级别(由$α$索引)任务的分布。我们的目标是理解预训练Transformer在不同测试分布$μ$上的性能,该测试分布由固定难度$β\in\mathcal{A}$的任务组成,且可能相对于$π_β$存在分布偏移,但需满足卡方散度$χ^2(μ,π_β)$至多为$κ$。具体而言,我们考虑具有随机光滑度的非参数回归问题,以及具有随机光滑度和随机有效维度的多指标模型。我们证明,在充足数据上预训练的大型Transformer能够实现与难度级别$β$对应的最优收敛速率,且该速率在卡方散度球内的所有测试分布$μ$上保持一致。因此,预训练Transformer能够在较简单任务上实现更快的收敛速率,并对测试时的分布偏移具有鲁棒性。最后,我们证明即使某个估计器能够访问测试分布$μ$,其在$μ$上的期望风险收敛速率也无法超越我们预训练Transformer的收敛速率,从而提供了比极小极大下界更恰当的最优性保证。