We introduce the Graded Transformer framework, a novel class of sequence models that embeds algebraic inductive biases through grading transformations on vector spaces. Extending the theory of Graded Neural Networks (GNNs), we propose two architectures: the Linearly Graded Transformer (LGT) and the Exponentially Graded Transformer (EGT). These models apply parameterized scaling operators-governed by fixed or learnable grading tuples and, for EGT, exponential factors to infuse hierarchical structure into attention and representation layers, enhancing efficiency for structured data. We derive rigorous theoretical guarantees, including universal approximation theorems for continuous and Sobolev functions, reduced sample complexity via effective VC dimension bounds, Lipschitz continuity of graded operations, and robustness to adversarial perturbations. A graded loss function ensures gradient stability and alignment with domain priors during optimization. By treating grades as differentiable parameters, the framework enables adaptive feature prioritization, overcoming limitations of fixed grades in prior work. The Graded Transformer holds transformative potential for hierarchical learning and neurosymbolic reasoning, with applications spanning algebraic geometry (e.g., moduli spaces and zeta functions), physics (e.g., multiscale simulations), natural language processing (e.g., syntactic parsing), biological sequence analysis (e.g., variant prediction), and emerging areas like graph neural networks and financial modeling. This work advances structured deep learning by fusing geometric and algebraic principles with attention mechanisms, offering a mathematically grounded alternative to data-driven models and paving the way for interpretable, efficient systems in complex domains.
翻译:本文提出分级Transformer框架,这是一类通过在向量空间上实施分级变换来嵌入代数归纳偏置的新型序列模型。基于分级神经网络(GNNs)理论,我们提出了两种架构:线性分级Transformer(LGT)与指数分级Transformer(EGT)。这些模型采用参数化的缩放算子——由固定或可学习的分级元组控制,对于EGT还引入指数因子——将层次结构注入注意力与表示层,从而提升结构化数据的处理效率。我们推导了严格的理论保证,包括连续函数与Sobolev函数的通用逼近定理、通过有效VC维约束降低的样本复杂度、分级运算的Lipschitz连续性以及对对抗扰动的鲁棒性。分级损失函数在优化过程中确保梯度稳定性并与领域先验保持一致。通过将分级参数视为可微分变量,该框架实现了自适应特征优先级调整,克服了先前工作中固定分级模式的局限性。分级Transformer在层次化学习与神经符号推理方面具有变革性潜力,其应用涵盖代数几何(如模空间与zeta函数)、物理学(如多尺度模拟)、自然语言处理(如句法解析)、生物序列分析(如变异预测)以及图神经网络与金融建模等新兴领域。本研究通过将几何与代数原理同注意力机制相融合,推进了结构化深度学习的发展,为数据驱动模型提供了数学基础坚实的替代方案,并为复杂领域可解释、高效的系统开辟了新路径。