This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.
翻译:本文研究了Transformer在表示带噪声线性动态系统族时,其上下文学习能力的近似理论特性。我们的第一个理论结果建立了多层Transformer在任务间统一定义的$L^2$测试损失下的近似误差上界。该结果表明,具有对数深度的Transformer能够达到与最小二乘估计器相当的误差界。与此相对,我们的第二个结果为某类单层线性Transformer建立了近似误差的非递减下界,这揭示了Transformer在动态系统上下文学习中存在深度分离现象。此外,第二个结果揭示了单层线性Transformer在从独立同分布数据与非独立同分布数据中学习时,其近似能力存在关键差异。