Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher order optimization methods, beyond the case of linear regression. We establish that linear attention Transformers with ReLU layers can approximate second order optimization algorithms for the task of logistic regression and achieve $\epsilon$ error with only a logarithmic to the error more layers. As a by-product we demonstrate the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers. These results suggest the ability of the Transformer architecture to implement complex algorithms, beyond gradient descent.
翻译:基于Transformer的模型展现了惊人的上下文学习能力,这引发了对其内在机制的广泛研究。近期研究表明,Transformer可执行一阶优化算法实现上下文学习,甚至在线性回归场景中实现二阶优化算法。本研究探讨了Transformer是否能在超越线性回归的场景中执行高阶优化方法。我们证明,带有ReLU层的线性注意力Transformer可针对逻辑回归任务近似二阶优化算法,且仅需误差的对数级层数即可达到$\epsilon$误差。作为副产品,我们进一步证明仅含线性注意力的Transformer仅需两层即可实现矩阵求逆的牛顿迭代单步操作。这些结果表明,Transformer架构具备执行超越梯度下降的复杂算法的能力。