We study in-context learning for nonparametric regression with $α$-Hölder smooth regression functions, for some $α>0$. We prove that, with $n$ in-context examples and $d$-dimensional regression covariates, a pretrained transformer with $Θ(\log n)$ parameters and $Ω\bigl(n^{2α/(2α+d)}\log^3 n\bigr)$ pretraining sequences can achieve the minimax-optimal rate of convergence $O\bigl(n^{-2α/(2α+d)}\bigr)$ in mean squared error. Our result requires substantially fewer transformer parameters and pretraining sequences than previous results in the literature. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.
翻译:我们研究了具有$α$-Hölder光滑回归函数的非参数回归的上下文学习,其中$α>0$。我们证明,在给定$n$个上下文示例和$d$维回归协变量的情况下,一个具有$Θ(\log n)$参数且经过$Ω\bigl(n^{2α/(2α+d)}\log^3 n\bigr)$条预训练序列预训练的Transformer,能够达到均方误差的极小化最优收敛速率$O\bigl(n^{-2α/(2α+d)}\bigr)$。我们的结果所需的Transformer参数和预训练序列数量远少于文献中的先前结果。这是通过证明Transformer能够通过实现核加权多项式基并运行梯度下降,从而高效地逼近局部多项式估计器来实现的。