Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose a an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
翻译:诸如生成测试用例与奖励建模等合成验证技术,是提升大型语言模型(LLM)编码能力、超越预定义测试的常用方法。此外,代码验证最近作为通过强化学习提升LLM推理能力的关键组成部分取得了巨大成功。在本文中,我们提出一种方法,可将现有编码基准转化为评分与排序数据集,以评估合成验证器的有效性。我们还提出了多种指标,结合所提出的基准来衡量合成验证器的不同方面。通过应用所提出的方法,我们发布了四个新基准(HE-R、HE-R+、MBPP-R和MBPP-R+),并利用标准、基于推理和基于奖励的LLM对合成验证方法进行了分析。我们的实验表明,推理能显著改善测试用例生成,并且增加测试用例数量可提升验证准确率。