Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing and difficulty scaling. We introduce a four-stage Data Processing Framework encompassing collection, processing, filtering, and verification, incorporating Automatic Difficulty Filtering via an LLM-based predict-calibrate-select framework that leverages multi-dimensional difficulty metrics across five weighted dimensions to retain challenging problems while removing simplistic ones. The resulting MicroCoder dataset comprises tens of thousands of curated real competitive programming problems from diverse platforms, emphasizing recency and difficulty. Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO and its variant training algorithms. The MicroCoder dataset delivers obvious improvements on medium and hard problems across different model sizes, achieving up to 17.2% relative gains in overall performance where model capabilities are most stretched. These results validate that difficulty-aware data curation improves model performance on challenging tasks, providing multiple insights for dataset creation in code generation.

翻译：训练下一代代码生成模型需要高质量数据集，然而现有数据集面临难度不平衡、格式不一致和数据质量问题。我们通过系统性数据处理和难度扩展来解决这些挑战。我们提出了一个四阶段数据处理框架，涵盖收集、处理、筛选和验证环节，其中引入了基于LLM的预测-校准-选择框架实现自动难度筛选。该框架利用跨五个加权维度的多维难度指标，在保留挑战性问题的同时剔除简单问题。最终构建的MicroCoder数据集包含来自多个平台的数万个精选真实编程竞赛问题，强调新颖性和难度。在严格未见过的LiveCodeBench上的评估表明，与规模相当的广泛使用的基线数据集相比，MicroCoder在300个训练步内实现了3倍更大的性能提升，且在GRPO及其变体训练算法下均保持稳定优势。MicroCoder数据集在不同模型规模上对中高难度问题均带来显著改进，在模型能力最受挑战的场景下实现了高达17.2%的整体性能相对提升。这些结果验证了难度感知的数据筛选能提升模型在挑战性任务上的表现，为代码生成领域的数据集构建提供了多重启示。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

使用多模态大语言模型进行深度学习的图像、文本和语音数据增强：综述

专知会员服务

28+阅读 · 2025年2月4日

通过强化学习增强代码生成中的代码大语言模型：综述

专知会员服务

29+阅读 · 2025年1月1日

大规模语言模型生成的合成数据中的质量、多样性与复杂性效应综述

专知会员服务

32+阅读 · 2024年12月10日

《大语言模型的数据合成与增强综述》

专知会员服务

43+阅读 · 2024年10月19日