Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing and difficulty scaling. We introduce a four-stage Data Processing Framework encompassing collection, processing, filtering, and verification, incorporating Automatic Difficulty Filtering via an LLM-based predict-calibrate-select framework that leverages multi-dimensional difficulty metrics across five weighted dimensions to retain challenging problems while removing simplistic ones. The resulting MicroCoder dataset comprises tens of thousands of curated real competitive programming problems from diverse platforms, emphasizing recency and difficulty. Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO and its variant training algorithms. The MicroCoder dataset delivers obvious improvements on medium and hard problems across different model sizes, achieving up to 17.2% relative gains in overall performance where model capabilities are most stretched. These results validate that difficulty-aware data curation improves model performance on challenging tasks, providing multiple insights for dataset creation in code generation.
翻译:训练下一代代码生成模型需要高质量数据集,然而现有数据集面临难度不平衡、格式不一致和数据质量问题。我们通过系统性数据处理和难度扩展来解决这些挑战。我们提出了一个四阶段数据处理框架,涵盖收集、处理、筛选和验证环节,其中引入了基于LLM的预测-校准-选择框架实现自动难度筛选。该框架利用跨五个加权维度的多维难度指标,在保留挑战性问题的同时剔除简单问题。最终构建的MicroCoder数据集包含来自多个平台的数万个精选真实编程竞赛问题,强调新颖性和难度。在严格未见过的LiveCodeBench上的评估表明,与规模相当的广泛使用的基线数据集相比,MicroCoder在300个训练步内实现了3倍更大的性能提升,且在GRPO及其变体训练算法下均保持稳定优势。MicroCoder数据集在不同模型规模上对中高难度问题均带来显著改进,在模型能力最受挑战的场景下实现了高达17.2%的整体性能相对提升。这些结果验证了难度感知的数据筛选能提升模型在挑战性任务上的表现,为代码生成领域的数据集构建提供了多重启示。