Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods those that do not rely on external model assistance on two million scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on high quality data. We also analyzed the limitations of several current approaches, explaining why they perform poorly on large-scale datasets and why they are unsuitable for such contexts. Finally, we found that filtering data by token length offers a stable and efficient method for improving results. This approach, particularly when training on long text data, proves highly beneficial for relatively weaker base models, such as Llama3.
翻译:监督微调(SFT)对于将大型语言模型(LLM)与人类指令对齐至关重要。SFT阶段的核心目标是从大规模训练池中选取一个体量小但代表性强的数据子集,使得基于该子集的微调效果能够达到甚至超越使用全量数据的结果。然而,现有的大多数数据选择方法均针对小规模数据池设计,难以满足实际SFT场景的需求。本文在百万级规模的数据集上复现了多种不依赖外部模型辅助的自评分方法,发现几乎所有方法在处理如此大规模数据池时,均难以显著超越随机选择的效果。此外,我们的对比研究表明,在SFT过程中,数据选择的多样性比单纯追求高质量数据更为关键。我们还分析了当前若干方法的局限性,阐释了它们为何在大规模数据集上表现不佳,以及为何不适用于此类场景。最后,我们发现通过词元长度筛选数据是一种稳定且高效的提升效果的方法。这一策略在长文本数据训练场景中尤为有效,对于相对较弱的基础模型(如Llama3)具有显著增益。