Large language models (LLMs) have shown remarkable effectiveness across various domains, with data augmentation methods utilizing GPT for synthetic data generation becoming prevalent. However, the quality and utility of augmented data remain questionable, and current methods lack clear metrics for evaluating data characteristics. To address these challenges, we propose ResoFilter, a novel method that integrates models, data, and tasks to refine datasets. ResoFilter leverages the fine-tuning process to obtain Data-Parameter features for data selection, offering improved interpretability by representing data characteristics through model weights. Our experiments demonstrate that ResoFilter achieves comparable results to full-scale fine-tuning using only half the data in mathematical tasks and exhibits strong generalization across different models and domains. This method provides valuable insights for constructing synthetic datasets and evaluating high-quality data, offering a promising solution for enhancing data augmentation techniques and improving training dataset quality for LLMs. For reproducibility, we will release our code and data upon acceptance.
翻译:大语言模型(LLM)在各个领域展现出卓越效能,其中利用GPT生成合成数据的数据增强方法日益普及。然而,增强数据的质量与实用性仍存疑,现有方法也缺乏评估数据特征的明确指标。为应对这些挑战,我们提出ResoFilter——一种融合模型、数据与任务的创新数据集优化方法。ResoFilter通过微调过程获取用于数据筛选的数据-参数特征,借助模型权重表征数据特性,从而提升可解释性。实验表明,在数学任务中,ResoFilter仅需半数数据即可达到与全量微调相当的效果,并在不同模型与领域间展现出强大的泛化能力。该方法为构建合成数据集与评估高质量数据提供了重要洞见,为增强数据增强技术与提升LLM训练数据集质量提供了可行方案。为保障可复现性,我们将在论文录用后公开代码与数据。