Existing self-evolution methods overlook the influence of fine-grained reasoning steps, which leads to the reasoner-verifier gap. The computational inefficiency of Monte Carlo (MC) process supervision further exacerbates the difficulty in mitigating the gap. Motivated by the Error-Related Negativity (ERN), which the reasoner can localize error following incorrect decisions, guiding rapid adjustments, we propose a Self-Adaptive Process Optimization (SAPO) method for self-improvement in Small Language Models (SLMs). SAPO adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap rather than relying on inefficient MC estimations. Extensive experiments demonstrate that the proposed method outperforms most existing self-evolution methods on two challenging task types: mathematics and code. Additionally, to further investigate SAPO's impact on verifier performance, this work introduces two new benchmarks for process reward models in both mathematical and coding tasks.
翻译:现有的自我进化方法忽视了细粒度推理步骤的影响,这导致了推理器-验证器之间的性能差距。蒙特卡洛(MC)过程监督的计算效率低下进一步加剧了弥合这一差距的难度。受错误相关负波(ERN)的启发——推理器能够在错误决策后定位错误并指导快速调整,我们提出了一种用于小型语言模型(SLMs)自我改进的自适应过程优化(SAPO)方法。SAPO通过主动最小化推理器-验证器差距,而非依赖低效的MC估计,来自适应且高效地引入过程监督信号。大量实验表明,所提出的方法在数学和代码这两类具有挑战性的任务类型上,性能优于大多数现有的自我进化方法。此外,为了进一步研究SAPO对验证器性能的影响,本研究引入了两个用于数学和编码任务的过程奖励模型的新基准。