In the era of flourishing large-scale models, the challenge of selecting and optimizing datasets from the vast and complex sea of data, to enhance the performance of large language models within the constraints of limited computational resources, has become paramount. This paper details our solution for the BetterMixture challenge, which focuses on the fine-tuning data mixing for large language models. Our approach, which secured third place, incorporates data deduplication, low-level and high-level quality filtering, and diversity selection. The foundation of our solution is Ke-Data-Juicer, an extension of Data-Juicer, demonstrating its robust capabilities in handling and optimizing data for large language models.
翻译:在大规模模型蓬勃发展的时代,如何从浩瀚复杂的数据海洋中筛选和优化数据集,以在有限计算资源约束下提升大型语言模型性能,已成为关键挑战。本文详细介绍了我们针对BetterMixture竞赛的解决方案,该竞赛聚焦于大型语言模型的微调数据混合任务。我们的方案获得了第三名,集成了数据去重、低层次与高层次质量过滤以及多样性选择。解决方案的基础是基于Data-Juicer扩展的Ke-Data-Juicer,展现了其在处理与优化大型语言模型数据方面的强大能力。