MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining

Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models. We open source our model weights and code.

翻译：尽管BERT风格的编码器模型在自然语言处理研究中被广泛使用，但许多研究者因高昂的训练成本而不会从头预训练自己的BERT模型。在BERT首次崭露头角的过去五年中，其他Transformer架构和训练配置取得了诸多进展，但这些进展尚未被系统地整合到BERT中。本文提出MosaicBERT，这是一种经过经验优化、专为快速预训练设计的BERT风格编码器架构与训练方案。该高效架构在经典Transformer编码器模块中整合了FlashAttention、线性偏置注意力（ALiBi）、门控线性单元（GLU）、动态移除填充令牌的模块以及低精度层归一化。训练方案包括：掩码语言建模（MLM）目标下30%的掩码比例、bfloat16精度、针对GPU吞吐量优化的词汇表大小，以及来自RoBERTa和其他编码器模型的最佳实践。在C4数据集上从头预训练后，该基础模型在8块A100 80 GB GPU上仅需1.13小时（成本约20美元）即可在下游GLUE（开发集）上取得79.6的平均得分。我们绘制了广泛的准确率与预训练速度帕累托曲线，并表明与具有竞争力的BERT基础模型和大模型相比，MosaicBERT基础模型和大模型始终处于帕累托最优状态。这种预训练速度的经验性提升使研究人员和工程师能够以低成本预训练自定义BERT风格模型，而非在现有通用模型上进行微调。我们开源了模型权重和代码。