In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a "quality score". By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances, leading to improved training efficiency for LM models. Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training. For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset.
翻译:近年来,语言模型(LM)的训练依赖于对大规模数据集进行高计算量的训练,这一过程极为繁重。本文提出了一种新方法,能够以与模型无关的方式对大型无标注NLP数据集中的文本质量进行数值评估,并为文本实例赋予"质量分数"。通过提出文本质量度量指标,本文建立了一个框架来识别并剔除低质量文本实例,从而提升LM模型的训练效率。在多个模型和数据集上的实验结果证明了该方法的有效性,在训练效果上取得了显著提升,并凸显了资源高效训练LM的潜力。例如,在OpenWebText数据集上训练时,多个LM模型在14个下游评估任务上的平均绝对准确率提升0.9%,同时数据使用量减少40%,训练速度加快42%;在Wikipedia数据集上,平均绝对准确率提升0.8%,数据使用量减少20%,训练速度加快21%。