MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights and code.

翻译：尽管BERT风格编码器模型在NLP研究中被广泛使用，但许多研究人员因高昂的训练成本而无法从头开始预训练自己的BERT模型。在BERT首次成为主流后的五年间，其他Transformer架构和训练配置取得了诸多进展，但这些进展尚未系统性地整合到BERT中。本文介绍MosaicBERT——一种经过经验优化的BERT风格编码器架构与训练方案，专为快速预训练设计。该高效架构将FlashAttention、线性偏置注意力（ALiBi）、门控线性单元（GLU）、动态移除填充令牌模块及低精度层归一化融入经典Transformer编码器块中。训练方案包含：面向掩码语言建模（MLM）任务的30%掩码率、bfloat16精度、针对GPU吞吐量优化的词汇表大小，以及RoBERTa和其他编码器模型的最佳实践。在C4数据集上从头预训练后，该基础模型在8块A100 80 GB GPU上仅需1.13小时（成本约20美元）即可达到GLUE（开发集）平均分79.6。我们绘制了广泛的精度与预训练速度帕累托曲线，表明与竞争性BERT基础版和大模型相比，MosaicBERT基础版和大模型始终处于帕累托最优状态。这种预训练速度的经验性提升使研究人员和工程师能够以低成本预训练定制化BERT风格模型，而非在现有通用模型上进行微调。我们开源了模型权重与代码。