Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from all the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation $\mathbb{E}[Y|X]$ and the conditional variance Var$(Y|X)$. This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window $S$, we prove a generalization bound of $\tilde{\mathcal{O}}(\sqrt{\min\{S, T\}/(n T)})$ on $n$ tasks with sequences of length $T$, providing sharper analysis compared to previous results of $\tilde{\mathcal{O}}(\sqrt{1/n})$. Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the \textit{equivalence} between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.

翻译：预测简单函数类别已被广泛用作发展理论和理解已训练Transformer的上下文学习能力的研究平台。本文重新审视了Transformer在线性回归任务上的训练过程，与现有文献不同，我们考虑了一个双目标预测任务：同时预测条件期望$\mathbb{E}[Y|X]$与条件方差Var$(Y|X)$。这一额外的**不确定性量化目标**为以下两方面提供了切入点：(i) 设计更优的分布外实验以区分上下文学习与权重内学习；(ii) 更清晰地区分是否利用训练分布先验信息的算法。理论上，我们证明已训练的Transformer能达到接近贝叶斯最优的性能，表明其利用了训练分布的信息。我们的方法可扩展至其他场景：针对Transformer的上下文窗口$S$，我们在$n$个序列长度为$T$的任务上证明了$\tilde{\mathcal{O}}(\sqrt{\min\{S, T\}/(n T)})$的泛化界，相比先前$\tilde{\mathcal{O}}(\sqrt{1/n})$的结果提供了更精细的分析。实证研究表明，虽然已训练的Transformer在分布内监督训练中自然表现为贝叶斯最优解，但在面对任务偏移时并不必然执行贝叶斯推理——这与许多现有文献提出的二者**等价性**形成对比。我们还展示了已训练Transformer在协变量偏移和提示长度偏移场景下的上下文学习能力，并将其解释为对元分布的泛化。