We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.
翻译:本文提出直接价值优化(DVO),一种创新的强化学习框架,用于增强大语言模型在复杂推理任务中的表现。与传统依赖偏好标签的方法不同,DVO利用个体推理步骤中的价值信号,通过均方误差损失优化模型。DVO的核心优势在于其细粒度监督机制,避免了需要大量人工标注的负担。DVO中的目标价值可通过蒙特卡洛树搜索或结果价值模型进行估计。我们在数学推理和常识推理任务上的实证分析表明,即使训练步骤更少,DVO也始终优于现有的离线偏好优化技术。这些发现强调了价值信号在提升推理能力中的重要性,并凸显了DVO在缺乏明确人类偏好信息场景下的优越性。