In the Large Language Model(LLM) reasoning scenario, people often estimate state value via Monte Carlo sampling. Though Monte Carlo estimation is an elegant method with less inductive bias, noise and errors are inevitably introduced due to the limited sampling. To handle the problem, we inject the structural prior into the value representation and transfer the scalar value into the expectation of a pre-defined categorical distribution, representing the noise and errors from a distribution perspective. Specifically, by treating the result of Monte Carlo sampling as a single sample from the prior ground-truth Binomial distribution, we quantify the sampling error as the mismatch between posterior estimated distribution and ground-truth distribution, which is thus optimized via distribution selection optimization. We test the performance of value-based process verifiers on Best-of-N task and Beam search task. Compared with the scalar value representation, we show that reasonable structural prior injection induced by different objective functions or optimization methods can improve the performance of value-based process verifiers for about 1$\sim$2 points at little-to-no cost. We also show that under different structural prior, the verifiers' performances vary greatly despite having the same optimal solution, indicating the importance of reasonable structural prior injection.
翻译:在大语言模型(LLM)推理场景中,人们通常通过蒙特卡洛采样来估计状态值。尽管蒙特卡洛估计是一种优雅且归纳偏置较少的方法,但由于采样有限,不可避免地会引入噪声和误差。为解决此问题,我们将结构先验注入到值表示中,并将标量值转换为预定义分类分布的期望,从而从分布角度表征噪声和误差。具体而言,通过将蒙特卡洛采样的结果视为来自先验真实二项分布的单个样本,我们将采样误差量化为后验估计分布与真实分布之间的不匹配,进而通过分布选择优化进行优化。我们在Best-of-N任务和Beam搜索任务上测试了基于值的过程验证器的性能。与标量值表示相比,我们表明由不同目标函数或优化方法诱导的合理结构先验注入,能够以极低甚至零成本将基于值的过程验证器的性能提升约1$\sim$2个点。我们还表明,在不同结构先验下,尽管具有相同的最优解,验证器的性能差异很大,这说明了合理注入结构先验的重要性。