We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to conventional compile-and-run instruments, RM-RF provides substantially lower latency and infrastructure cost while delivering competitive predictive fidelity, enabling fast, scalable feedback for large-scale test generation and RL-based code optimization.
翻译:本文提出RM-RF,一种用于自动生成单元测试的免运行评估的轻量级奖励模型。该方法无需重复编译和执行候选测试,仅通过源代码和测试代码即可预测三个执行衍生信号:(1)增强后的测试套件能否成功编译运行,(2)生成的测试用例是否提高了代码覆盖率,(3)生成的测试用例是否改善了变异杀死率。为训练和评估RM-RF,我们构建了一个多语言(Java、Python、Go)数据集,包含核心文件、测试文件以及通过基于执行的流水线标注的候选测试增补,并发布了用于对比评估的配套数据集与方法论。我们测试了多种模型族与调优机制(零样本、全微调及基于LoRA的参数高效微调),在三个预测目标上平均F1值达到0.69。与传统编译运行工具相比,RM-RF在保持竞争力的预测保真度的同时,显著降低了延迟与基础设施成本,从而为大规模测试生成和基于强化学习的代码优化提供快速、可扩展的反馈机制。