Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT in terms of interactions and syntactic representations. All models from this work are available at https://github.com/jxiw/BiGS.
翻译:Transformer架构在自然语言处理预训练中一直至关重要。尽管其他架构也得到应用,但在下游任务精度上要么显著逊色,要么需要引入注意力层才能达到GLUE等标准基准的性能。本研究通过利用基于状态空间模型(SSMs)的序列路由最新进展,探索了无需注意力的预训练方法。我们提出的模型——双向门控SSM(BiGS),将SSM层与在简化序列建模架构中表现优异的乘性门控机制相结合。该模型学习不考量成对交互的静态层。即便如此,BiGS在GLUE基准上仍能媲美BERT的预训练精度,并可扩展至无需近似处理的4096词元长程预训练。分析表明,尽管两类模型平均精度相近,但BiGS在交互模式与句法表征方面具有与BERT不同的归纳偏置。本工作所有模型已开源发布于https://github.com/jxiw/BiGS。