A salient characteristic of large pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora. Our results demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data while retaining up to $\sim99\%$ of the performance of the fully-trained models.
翻译:大规模预训练语言模型(PTLMs)的一个显著特征是,随着模型容量和预训练数据集规模的增大,其泛化能力显著提升,并涌现出新的能力。因此,我们目睹了推动前沿技术发展的巨型模型的诞生。然而,必须认识到,这不可避免地导致训练时间过长、计算成本高昂,并对环境造成不利影响。目前,尽管在模型架构、训练流程和损失函数设计方面投入了大量精力以提高PTLM训练效率,但对训练数据效用的优化关注甚少。我们提出的关键问题是:是否可能仅使用训练数据中高度信息丰富的子集来训练PTLMs,同时保持下游性能?基于近期在信息丰富数据子集选择方面的进展,我们展示了如何利用子模优化来选取训练语料库中具有高度代表性的子集。我们的结果表明,所提出的框架可应用于仅使用少量数据高效训练多个PTLMs(BERT、BioBERT、GPT-2),同时保留全量训练模型高达约99%的性能。