Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.
翻译:随着语料库增长至数千万条数据点,全量微调变得昂贵且通常不必要,从大型语料中选取高质量小子集进行微调日益重要。我们提出CRAFT(基于聚类回归的自适应训练数据筛选),一种无需特定向量化方法的选择方法,用于训练序列到序列模型。CRAFT分解联合源-目标分布,并执行两阶段选择:(i) 通过跨k-means聚类的比例预算分配匹配验证源分布;(ii) 在每个源聚类内,选择目标嵌入最小化由验证目标分布导出的条件期望距离的训练对。我们证明比例聚类分配能约束所选分布与验证分布之间的连续KL散度,其残差由聚类直径控制。我们通过从3300万个NLLB句子对中选择训练数据并基于LoRA微调mBART,在英印翻译任务上评估CRAFT。CRAFT达到43.34 BLEU,在同一候选池和编码器下比TSDS(41.21 BLEU)高出2.13个点,且选择速度提升40倍以上。采用TF-IDF向量化时,整个流程在CPU上不到一分钟内完成。TAROT达到45.61 BLEU,但CRAFT在26.86秒内完成选择,而TAROT需75.6秒,实现2.8倍加速。