Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.
翻译:基于强化学习的代码生成依赖于单元测试通过率提供的可验证奖励。然而,高质量测试套件稀缺,现有数据集覆盖有限,且静态奖励无法随模型改进而自适应调整。近期自博弈方法将代码生成与测试生成统一于单一模型,但面临固有困境:白盒访问会导致模型为获取简单奖励而生成琐碎测试的“自我共谋”问题;而黑盒限制则产生通用测试,无法捕捉特定实现的缺陷。我们提出Code-A1——一种对抗协同进化框架,通过对抗性目标联合优化代码大语言模型与测试大语言模型。代码大语言模型因通过更多测试而获得奖励,测试大语言模型则因暴露更多缺陷而获得奖励。这种架构分离消除了自我共谋风险,并安全实现了白盒测试生成,使测试大语言模型能够检查候选代码以生成针对性对抗测试。我们进一步引入用于经验回放的“错题本”机制,以及平衡测试有效性与对抗难度的复合奖励函数。基于Qwen2.5-Coder模型的实验表明,Code-A1实现的代码生成性能达到甚至超越基于人工标注测试训练的模型,同时显著提升了测试生成能力。