Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of architecture for modeling long range dependencies across various modalities. However, they invariably rely on discretization of a continuous state space, which complicates their presentation and understanding. In this work, we dispose of the discretization step, and propose a model based on vanilla Diagonal Linear RNNs ($\mathrm{DLR}$). We empirically show that, despite being conceptually much simpler, $\mathrm{DLR}$ is as performant as previously-proposed SSMs on a variety of tasks and benchmarks including Long Range Arena and raw speech classification. Moreover, we characterize the expressivity of SSMs (including $\mathrm{DLR}$) and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks involving interactions over tens of thousands of tokens, ranging from simple operations, such as shifting an input sequence, to detecting co-dependent visual features over long spatial ranges in flattened images. We find that while SSMs report near-perfect performance on tasks that can be modeled via $\textit{few}$ convolutional kernels, they struggle on tasks requiring $\textit{many}$ such kernels and especially when the desired sequence manipulation is $\textit{context-dependent}$. Despite these limitations, $\mathrm{DLR}$ reaches high performance on two higher-order reasoning tasks $\mathrm{ListOpsSubTrees}$ and $\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$ with input lengths $8K$ and $65K$ respectively, and gives encouraging performance on $\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$ with input length $262K$ for which attention is not a viable choice.
翻译:基于线性状态空间的序列模型(SSM)近期已成为跨模态长程依赖建模的优选架构,但其普遍依赖连续状态空间的离散化,增加了模型表述与理解的复杂性。本文摒弃离散化步骤,提出基于原始对角线性循环神经网络($\mathrm{DLR}$)的模型。实验表明,尽管概念上更为简洁,$\mathrm{DLR}$在长程竞技场、原始语音分类等多项任务与基准测试中仍能达到与先前SSM相当的性能。此外,我们通过包含数万token交互的$13$个合成序列到序列任务(涵盖从输入序列移位等简单操作到展平图像中长空间范围共依赖视觉特征检测等复杂任务)系统刻画了SSM(含$\mathrm{DLR}$)与注意力模型的表达能力。研究发现,SSM在可通过$\textit{少量}$卷积核建模的任务中表现接近完美,但在需要$\textit{大量}$卷积核、尤其是序列操作具有$\textit{上下文依赖性}$时存在局限。尽管存在这些不足,$\mathrm{DLR}$在输入长度分别为$8K$和$65K$的两个高阶推理任务$\mathrm{ListOpsSubTrees}$与$\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$中达到高性能,并在输入长度达$262K$且注意力机制不可行的$\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$任务中展现出令人鼓舞的表现。