Stability and robustness are critical for deploying Transformers in safety-sensitive settings. A principled way to enforce such behavior is to constrain the model's Lipschitz constant. However, approximation-theoretic guarantees for architectures that explicitly preserve Lipschitz continuity have yet to be established. In this work, we bridge this gap by introducing a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction. We realize both MLP and attention blocks as explicit Euler steps of negative gradient flows, ensuring inherent stability without sacrificing expressivity. We prove a universal approximation theorem for this class within a Lipschitz-constrained function space. Crucially, our analysis adopts a measure-theoretic formalism, interpreting Transformers as operators on probability measures, to yield approximation guarantees independent of token count. These results provide a rigorous theoretical foundation for the design of robust, Lipschitz continuous Transformer architectures.
翻译:在安全敏感场景中部署Transformer模型时,稳定性与鲁棒性至关重要。约束模型的Lipschitz常数是实现该特性的原则性方法。然而,目前尚未建立对显式保持Lipschitz连续性的架构的逼近理论保证。本研究通过构建一类具有Lipschitz连续性的梯度下降型上下文Transformer,填补了这一理论空白。我们将MLP与注意力模块实现为负梯度流的显式欧拉步,从而在不牺牲表达力的前提下确保固有稳定性。在此类架构所对应的Lipschitz约束函数空间中,我们证明了其通用逼近定理。尤为关键的是,本研究采用测度论形式体系,将Transformer解释为概率测度上的算子,从而获得与标记数量无关的逼近保证。这些成果为设计鲁棒的Lipschitz连续Transformer架构奠定了严格的理论基础。