Deep neural networks based on linear complex-valued RNNs interleaved with position-wise MLPs are gaining traction as competitive approaches to sequence modeling. Examples of such architectures include state-space models (SSMs) like S4, LRU, and Mamba: recently proposed models that achieve promising performance on text, genetics, and other data that require long-range reasoning. Despite experimental evidence highlighting these architectures' effectiveness and computational efficiency, their expressive power remains relatively unexplored, especially in connection to specific choices crucial in practice - e.g., carefully designed initialization distribution and use of complex numbers. In this paper, we show that combining MLPs with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of regular causal sequence-to-sequence maps. At the heart of our proof, we rely on a separation of concerns: the linear RNN provides a lossless encoding of the input sequence, and the MLP performs non-linear processing on this encoding. While we show that using real diagonal linear recurrences is enough to achieve universality in this architecture, we prove that employing complex eigenvalues near unit disk - i.e., empirically the most successful strategy in SSMs - greatly helps the RNN in storing information. We connect this finding with the vanishing gradient issue and provide experimental evidence supporting our claims.
翻译:基于线性复值RNN与位置级MLP交织的深度神经网络,正逐渐成为序列建模领域的竞争性方法。此类架构的实例包括状态空间模型(SSM),如S4、LRU和Mamba——这些近期提出的模型在需要长程推理的文本、基因组学等数据上展现出优异性能。尽管实验证据凸显了这些架构的有效性和计算效率,但其表达能力仍相对未得到充分探索,尤其是在实践中至关重要的一些特定选择方面——例如精心设计的初始化分布和复数的使用。本文证明,将MLP与实部或复部线性对角递归相结合,能够以任意精度逼近正则因果序列到序列映射。我们证明的核心依赖于关注点分离原则:线性RNN提供输入序列的无损编码,而MLP则对该编码进行非线性处理。虽然我们证明使用实对角线性递归已足以实现该架构的普适性,但进一步证明,采用接近单位盘的复特征值——即SSM中最成功的经验策略——能极大帮助RNN存储信息。我们将这一发现与梯度消失问题相联系,并提供实验证据支持我们的主张。