Language Models (LMs) pre-trained with self-supervision on large text corpora have become the default starting point for developing models for various NLP tasks. Once the pre-training corpus has been assembled, all data samples in the corpus are treated with equal importance during LM pre-training. However, due to varying levels of relevance and quality of data, equal importance to all the data samples may not be the optimal choice. While data reweighting has been explored in the context of task-specific supervised learning and LM fine-tuning, model-driven reweighting for pre-training data has not been explored. We fill this important gap and propose PRESENCE, a method for jointly reweighting samples by leveraging self-influence (SI) scores as an indicator of sample importance and pre-training. PRESENCE promotes novelty and stability for model pre-training. Through extensive analysis spanning multiple model sizes, datasets, and tasks, we present PRESENCE as an important first step in the research direction of sample reweighting for pre-training language models.
翻译:语言模型(LMs)通过在大规模文本语料库上进行自监督预训练,已成为开发各类自然语言处理任务的默认起点。然而,当预训练语料库构建完成后,所有数据样本在LM预训练过程中被赋予同等重要性。由于数据相关性和质量存在差异,对所有样本赋予同等重要性可能并非最优选择。尽管在任务导向的监督学习及LM微调领域已探索过数据重加权方法,但基于模型的预训练数据重加权尚未得到研究。我们填补了这一重要空白,提出PRESENCE方法——通过利用自我影响(SI)分数作为样本重要性指标,联合重加权样本并进行预训练。PRESENCE能促进模型预训练的鲁棒性与多样性。通过覆盖多种模型规模、数据集及任务的广泛分析,我们证明PRESENCE是预训练语言模型样本重加权研究方向的重要开端。