Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.
翻译:Transformer架构能够基于给定提示中的输入-输出对,通过上下文学习(ICL)解决未见任务。现有关于ICL的理论研究主要集中于线性回归任务,且通常假设输入为独立同分布。为理解Transformer在建模动态驱动函数时如何实现ICL,我们通过结构化ICCL设置研究马尔可夫函数学习,通过刻画损失景观揭示其底层优化行为。具体而言,我们(1)给出了单层线性自注意力(LSA)模型在扩大参数空间中的全局最小化闭式表达式;(2)证明在一般情况下,恢复能实现最优解的Transformer参数是NP困难的,这揭示了一层LSA在表示结构化动态函数时的根本性局限;(3)提出将多层LSA解释为执行预条件梯度下降以优化平方损失之外的多重目标。这些理论结果通过简化Transformer模型进行了数值验证。