Self attention encoders such as Bidirectional Encoder Representations from Transformers(BERT) scale quadratically with sequence length, making long context modeling expensive. Linear time state space models, such as Mamba, are efficient; however, they show limitations in modeling global interactions and can suffer from padding induced state contamination. We propose MaBERT, a hybrid encoder that interleaves Transformer layers for global dependency modeling with Mamba layers for linear time state updates. This design alternates global contextual integration with fast state accumulation, enabling efficient training and inference on long inputs. To stabilize variable length batching, we introduce paddingsafe masking, which blocks state propagation through padded positions, and mask aware attention pooling, which aggregates information only from valid tokens. On GLUE, MaBERT achieves the best mean score on five of the eight tasks, with strong performance on the CoLA and sentence pair inference tasks. When extending the context from 512 to 4,096 tokens, MaBERT reduces training time and inference latency by 2.36x and 2.43x, respectively, relative to the average of encoder baselines, demonstrating a practical long context efficient encoder.
翻译:诸如来自Transformer的双向编码器表示(BERT)等自注意力编码器的计算复杂度随序列长度呈二次方增长,使得长上下文建模代价高昂。线性时间状态空间模型,例如Mamba,虽然高效,但在建模全局交互方面存在局限,并且可能受到填充引起的状态污染影响。我们提出MaBERT,一种混合编码器,它交错使用Transformer层进行全局依赖建模和Mamba层进行线性时间状态更新。这种设计将全局上下文整合与快速状态累积交替进行,从而实现对长输入的高效训练和推理。为了稳定可变长度批处理,我们引入了填充安全掩码(阻止状态通过填充位置传播)和掩码感知注意力池化(仅从有效令牌聚合信息)。在GLUE基准测试中,MaBERT在八个任务中的五个上取得了最佳平均分数,在CoLA和句子对推理任务上表现出色。当上下文长度从512个令牌扩展到4,096个令牌时,相对于编码器基线的平均值,MaBERT分别将训练时间和推理延迟降低了2.36倍和2.43倍,证明其是一种实用的长上下文高效编码器。