For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.
翻译:摘要:为实现绿色人工智能,测量并减少大语言模型训练过程中的碳排放至关重要。在自然语言处理领域,对Transformer模型进行预训练需要大量计算资源。该预训练过程通过利用海量文本数据获取先验知识,以支持下游任务。因此,从庞大语料库中筛选出符合特定领域任务的领域数据,对于实现最优结果至关重要。尽管在大规模无监督数据上训练成本高昂,但可通过在预训练前执行数据选择步骤来优化。选择重要数据能够降低空间开销并大幅减少模型预训练所需时间,同时保持准确率不变。我们研究了现有选择策略,并提出自己的领域自适应数据选择方法——TextGram,该方法能有效从大规模语料中挑选关键数据。我们对比并评估了文本分类任务中采用与未采用数据选择策略的微调模型结果,证明所提策略优于其他选择方法。