Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from all the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation $\mathbb{E}[Y|X]$ and the conditional variance Var$(Y|X)$. This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window $S$, we prove a generalization bound of $\tilde{\mathcal{O}}(\sqrt{\min\{S, T\}/(n T)})$ on $n$ tasks with sequences of length $T$, providing sharper analysis compared to previous results of $\tilde{\mathcal{O}}(\sqrt{1/n})$. Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the \textit{equivalence} between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.
翻译:预测简单函数类别已被广泛用作发展理论和理解已训练Transformer的上下文学习能力的研究平台。本文重新审视了Transformer在线性回归任务上的训练过程,与现有文献不同,我们考虑了一个双目标预测任务:同时预测条件期望$\mathbb{E}[Y|X]$与条件方差Var$(Y|X)$。这一额外的**不确定性量化目标**为以下两方面提供了切入点:(i) 设计更优的分布外实验以区分上下文学习与权重内学习;(ii) 更清晰地区分是否利用训练分布先验信息的算法。理论上,我们证明已训练的Transformer能达到接近贝叶斯最优的性能,表明其利用了训练分布的信息。我们的方法可扩展至其他场景:针对Transformer的上下文窗口$S$,我们在$n$个序列长度为$T$的任务上证明了$\tilde{\mathcal{O}}(\sqrt{\min\{S, T\}/(n T)})$的泛化界,相比先前$\tilde{\mathcal{O}}(\sqrt{1/n})$的结果提供了更精细的分析。实证研究表明,虽然已训练的Transformer在分布内监督训练中自然表现为贝叶斯最优解,但在面对任务偏移时并不必然执行贝叶斯推理——这与许多现有文献提出的二者**等价性**形成对比。我们还展示了已训练Transformer在协变量偏移和提示长度偏移场景下的上下文学习能力,并将其解释为对元分布的泛化。