State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.
翻译:状态空间模型(SSMs)在处理需要建模长距离依赖关系的任务上表现出色,并凭借其亚二次时间复杂度高效扩展到长序列。这类模型最初为连续信号设计,在视觉和音频领域的大量任务中展现了优越性能,但在语言建模任务上仍落后于Transformer。本文提出一种名为"块状状态变换器"(Block-State Transformer,BST)的混合层,其内部融合了用于长距离上下文提取的SSM子层和用于序列短程表示的块状Transformer子层。我们研究了三种完全可并行化的变体,它们实现了SSM与块状注意力的集成。实验表明,本模型在语言建模困惑度上优于基于Transformer的同类架构,并能泛化到更长的序列。此外,在采用模型并行化时,块状状态变换器在层级上的处理速度相比块状循环Transformer提升了十倍以上。