Common self-improvement approaches for large language models (LLMs), such as STaR (Zelikman et al., 2022), iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.
翻译:大型语言模型(LLMs)的常见自我改进方法(如STaR, Zelikman等人, 2022)通过迭代微调模型自身生成的解决方案来提升其问题解决能力。然而,这些方法通常会丢弃在此过程中产生的大量错误解决方案,可能忽略了其中蕴含的宝贵信息。为解决这一缺陷,我们提出V-STaR方法,利用自我改进过程中生成的全部正确与错误解决方案,通过直接偏好优化(DPO)训练一个判断模型生成解决方案正确性的验证器。在推理阶段,该验证器可从多个候选解决方案中筛选最优解。通过多轮V-STaR迭代,推理器与验证器性能逐步提升,在基于LLaMA2模型的常见代码生成与数学推理基准测试中,相较现有自我改进与验证方法实现了4%至17%的测试准确率提升。