Influence Scores at Scale for Efficient Language Data Sampling

Modern ML systems ingest data aggregated from diverse sources, such as synthetic, human-annotated, and live customer traffic. Understanding \textit{which} examples are important to the performance of a learning algorithm is crucial for efficient model training. Recently, a growing body of literature has given rise to various "influence scores," which use training artifacts such as model confidence or checkpointed gradients to identify important subsets of data. However, these methods have primarily been developed in computer vision settings, and it remains unclear how well they generalize to language-based tasks using pretrained models. In this paper, we explore the applicability of influence scores in language classification tasks. We evaluate a diverse subset of these scores on the SNLI dataset by quantifying accuracy changes in response to pruning training data through random and influence-score-based sampling. We then stress-test one of the scores -- "variance of gradients" (VoG) from Agarwal et al. (2022) -- in an NLU model stack that was exposed to dynamic user speech patterns in a voice assistant type of setting. Our experiments demonstrate that in many cases, encoder-based language models can be finetuned on roughly 50% of the original data without degradation in performance metrics. Along the way, we summarize lessons learned from applying out-of-the-box implementations of influence scores, quantify the effects of noisy and class-imbalanced data, and offer recommendations on score-based sampling for better accuracy and training efficiency.

翻译：现代机器学习系统摄入来自多样化来源（如合成数据、人工标注数据和实时客户流量）的数据。理解哪些示例对学习算法的性能至关重要，是实现高效模型训练的关键。近年来，越来越多的文献提出了各种"影响分数"，这些分数利用训练过程中的产物（如模型置信度或检查点梯度）来识别重要的数据子集。然而，这些方法主要在计算机视觉场景中开发，尚不清楚它们如何很好地推广到使用预训练模型的语言任务中。在本文中，我们探讨了影响分数在语言分类任务中的适用性。我们通过在SNLI数据集上基于随机和影响分数采样修剪训练数据并量化准确率变化，评估了这些分数的一个多样化子集。随后，我们对其中一种分数——即Agarwal等人（2022）提出的"梯度方差"（VoG）——进行了压力测试，该测试在暴露于语音助手类型场景中动态用户语音模式的NLU模型栈中进行。我们的实验表明，在许多情况下，基于编码器的语言模型可以在仅使用原始数据约50%的情况下进行微调，而性能指标不降低。在此过程中，我们总结了应用现成影响分数实现的经验教训，量化了噪声和类别不平衡数据的影响，并提供了基于分数采样以获得更好准确率和训练效率的建议。