We measure the rate at which code RL environments accept incorrect solutions as correct. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. On 20 R2E-Gym tasks across 6 repositories, the same pipeline at single-shot exploit generation yields 25.0%. A random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified finds, within the same human-rated difficulty stratum, model Pass@1 is +14.14 percentage points higher on flagged-hackable tasks than on robust ones (95% CI [+11.80, +16.48]; one-sided p < 10^-6; I^2 = 0%; 123 of 134 models positive). We then describe a procedure for hardening the broken tasks. An inline LLM judge with a Docker gold-sanity gate runs each generated test against the gold solution before the judge is consulted. On the 11 broken tasks in the audit, the gate flags 65 of 105 decisive LLM-generated tests as failing on the gold patch itself, a 61.9% per-augmentation defect rate the LLM judge alone misses. With diversity-biased retry, the loop converges 9 of 11 tasks to a gated upgrade.
翻译:我们测量了代码强化学习(RL)环境将错误解决方案视为正确的接受率。在SWE-bench Verified的49个任务样本中,28.5%的任务测试集存在缺陷,使得经过Docker验证的错误补丁能够通过。对于跨6个代码库的20个R2E-Gym任务,采用相同流水线进行单次漏洞利用生成时,该比例达到25.0%。对SWE-bench Verified中134个前沿模型提交的随机效应荟萃分析发现,在同一人工评级难度层内,模型在标记为可操纵任务上的Pass@1比稳健任务高出14.14个百分点(95%置信区间[+11.80, +16.48];单侧p<10^-6;I^2=0%;134个模型中123个呈正效应)。随后我们描述了加固缺陷任务的方法:采用内联LLM评判器配合Docker黄金门控机制,在调用评判器前对每个生成的测试用例进行黄金解决方案验证。在审计发现的11个缺陷任务中,该门控机制检测出105个关键LLM生成测试中有65个在黄金补丁上失效,而LLM评判器单独遗漏了61.9%的增强缺陷率。通过多样性偏置重试机制,循环收敛使11个任务中的9个升级为门控版本。