Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.
翻译:神经机器翻译模型对数据和计算资源的需求极高。然而,并非所有数据点对模型训练和泛化能力的贡献均等。通过数据剪枝去除低价值数据点,可在不明显降低模型性能的前提下大幅减少计算开销。本文提出一种新的数据剪枝技术:跨时间检查点(CAT),该技术利用模型早期训练动态来识别对模型性能最相关的数据点。我们将CAT与包括COMET-QE、LASER和LaBSE在内的多种数据剪枝技术进行基准测试。研究发现,在印欧语系的多个测试集上,CAT的性能优于所有基准方法。在英语-德语、英语-法语和英语-斯瓦希里语翻译任务中,CAT在剪枝高达50%训练数据的同时,仍能达到与使用完整数据集相当的翻译性能。通过分析CAT选择的数据点特征,我们发现该方法倾向于保留较长句子以及包含独特或罕见词汇的句子。