Large-scale language model pretraining is a very successful form of self-supervised learning in natural language processing, but it is increasingly expensive to perform as the models and pretraining corpora have become larger over time. We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2\times$. NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining, rather than all of the tokens as with the usual transformer encoder. We also show that NarrowBERT increases the throughput at inference time by as much as $3.5\times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI. Finally, we examine the performance of NarrowBERT on the IMDB and Amazon reviews classification and CoNLL NER tasks and show that it is also comparable to standard BERT performance.
翻译:大规模语言模型预训练是自然语言处理领域中一种非常成功的自监督学习形式,但随着模型和预训练语料库规模的不断扩大,其执行成本日益高昂。我们提出NarrowBERT,一种改进的Transformer编码器,能够将掩码语言模型预训练的吞吐量提升超过2倍。NarrowBERT对Transformer模型进行稀疏化处理,使得自注意力查询(queries)和前馈网络层在预训练期间仅对每个句子中的掩码词元(masked tokens)进行操作,而非像常规Transformer编码器那样处理所有词元。我们还证明,NarrowBERT在推理阶段的吞吐量最高可提升3.5倍,同时在MNLI等句子编码任务上的性能退化极小(甚至无退化)。最后,我们考察了NarrowBERT在IMDB和亚马逊评论分类任务以及CoNLL命名实体识别任务中的表现,发现其性能与标准BERT相当。