Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.
翻译:深度序列模型,从Transformer和状态空间模型(SSM)到近期诸如门控线性循环神经网络等更多方法,其核心都是将输出计算为过去数值向量的线性组合。为了深入理解并系统比较此类架构,我们开发了一个统一框架,通过将线性组合系数建模为由脉冲输入驱动的自治线性动力系统的输出,从而显式地表达这一输出操作。这一视角在本质上与关注线性循环神经网络与线性注意力之间联系的方法有显著不同,它揭示了跨多种架构的共同数学主题,并关键性地捕捉了在循环神经网络、状态空间模型及相关模型之上的softmax注意力机制。与通常仅在基准测试上评估的新模型提案不同,我们推导出将架构选择与模型特性联系起来的设计原则。由此,我们识别了表达能力与高效实现之间的权衡、输入选择性的几何约束,以及数值稳定训练与信息保持的稳定性条件。通过整合近期文献中的若干见解与观察,该框架既解释了近期设计在实证上的成功,也为系统设计新的序列模型架构提供了指导原则。