We study in-context learning for nonparametric regression with $α$-Hölder smooth regression functions, for some $α>0$. We prove that, with $n$ in-context examples and $d$-dimensional regression covariates, a pretrained transformer with $Θ(\log n)$ parameters and $Ω\bigl(n^{2α/(2α+d)}\log^3 n\bigr)$ pretraining sequences can achieve the minimax optimal rate of convergence $O\bigl(n^{-2α/(2α+d)}\bigr)$ in mean squared error. Our result requires substantially fewer transformer parameters and pretraining sequences than previous results in the literature. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.
翻译:我们研究具有α-Hölder光滑回归函数(α>0)的非参数回归中的上下文学习。我们证明,对于n个上下文示例和d维回归协变量,一个具有Θ(log n)个参数和Ω\bigl(n^{2α/(2α+d)}\log^3 n\bigr)个预训练序列的预训练Transformer能够达到均方误差下的极小极大最优收敛速率O\bigl(n^{-2α/(2α+d)}\bigr)。我们的结果所需的Transformer参数和预训练序列数量远少于现有文献中的结果。这一成果通过证明Transformer能够通过实现核加权多项式基并运行梯度下降来高效近似局部多项式估计量而实现。