Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant or uninformative. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves performance across multiple downstream tasks.
翻译:近期研究表明,在大语言模型(LLM)的监督微调(SFT)过程中,数据质量的重要性远胜于数据量。尽管多数数据清洗方法专注于对整个样本进行筛选,但样本内部各令牌的质量可能存在显著差异。在预训练完成后,即使对于高质量样本,其中与任务无关的模式或短语也可能存在冗余或信息量不足的问题。继续对这些模式进行微调可能收效甚微,甚至可能损害下游任务的性能。本文从噪声标签的视角探究令牌质量问题,并提出一种适用于SFT任务的通用令牌清洗流程。该方法在滤除信息贫乏令牌的同时,能有效保留承载关键任务信息的令牌。具体而言,我们首先通过评估模型更新对每个令牌的影响来量化令牌质量,随后采用基于阈值的分离机制。令牌影响度可通过固定参考模型进行单次测量,亦可借助自演化参考模型进行迭代评估。本文通过误差上界理论分析了两种方法的优势与局限。大量实验表明,该框架能在多种下游任务中持续提升模型性能。