Modern code generation models exhibit longer outputs, accelerated capability growth, and changed training dynamics, rendering traditional training methodologies, algorithms, and datasets ineffective for improving their performance. To address these training bottlenecks, we propose MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity. MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation. Additionally, we release MicroCoder-Dataset, a more challenging training corpus that achieves 3x larger performance gains than mainstream datasets on LiveCodeBench v6 within 300 training steps, and MicroCoder-Evaluator, a robust framework with approximately 25% improved evaluation accuracy and around 40% faster execution. Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 training insights across seven main aspects, demonstrating that properly trained models can achieve competitive performance with larger counterparts.
翻译:现代代码生成模型展现出更长的输出长度、加速的能力增长以及变化的训练动态,使得传统训练方法、算法和数据集在提升其性能方面效果有限。为应对这些训练瓶颈,本文提出MicroCoder-GRPO——一种改进的组相对策略优化方法,包含三项创新:条件截断掩码机制在保持训练稳定性的同时提升长输出潜力,多样性决定温度选择策略以维持并促进输出多样性,以及通过高裁剪比移除KL损失以促进解空间多样性。在LiveCodeBench v6基准测试中,MicroCoder-GRPO相较于强基线模型取得最高17.6%的相对性能提升,在扩展上下文评估场景下增益更为显著。此外,我们发布了MicroCoder-Dataset——一个更具挑战性的训练语料库,在300训练步数内于LiveCodeBench v6上实现比主流数据集高3倍的性能增益;以及MicroCoder-Evaluator——一个鲁棒的评估框架,其评估准确率提升约25%,执行速度加快约40%。通过对三十余组对照实验的系统分析,我们揭示了涵盖七大维度的34项训练洞见,证明经过恰当训练的模型能够达到与更大规模模型相竞争的性能水平。