Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $\tau$ iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.
翻译:大语言模型的预训练通常极其消耗资源且效率低下,未能充分利用训练文本序列中蕴含的信息。本文提出SpacTor这一新型训练流程,包含:(1)混合目标——结合跨度破坏(SC)与令牌替换检测(RTD);(2)两阶段课程——在前$\tau$次迭代中优化混合目标,随后过渡为标准SC损失。实验表明,混合目标的有效性与两阶段预训练计划密切相关,我们对此进行了深入分析。在编码器-解码器架构(T5)的多项NLP任务实验中,SpacTor-T5在下游任务表现上与标准SC预训练持平,同时预训练迭代次数减少50%,总FLOPs降低40%。在相同计算预算下,SpacTor显著提升了下游基准测试性能。