When to solve math problems, most language models take a sampling strategy to predict next word according conditional probabilities. In the math reasoning step, it may generate wrong answer. Considering math problems are deterministic, we propose a mixed policy exploration approach to solve math problems with reinforcement learning. In peculiar, we propose a two level token exploration policy: the abstract level explores next token with probability and the second level is deterministic. Specifically, the abstract level policy will decide whether the token is operator or operand with probability sampling, while the second level is deterministic to select next token with the highest score in a greedy way. We test our method on GSM8K dataset with GPT-2 model, and demonstrate more than $2\%$ performance gain. Our implementation is available at https://github.com/vividitytech/math_lm_rl.
翻译:在解决数学问题时,大多数语言模型采用采样策略,根据条件概率预测下一个词。在数学推理步骤中,这种方法可能生成错误答案。考虑到数学问题的确定性,我们提出一种混合策略探索方法,结合强化学习来解决数学问题。具体而言,我们提出两层令牌探索策略:抽象层以概率探索下一个令牌,第二层则采用确定性策略。抽象层策略通过概率采样决定令牌是运算符还是操作数,而第二层则使用贪婪方式确定性选择得分最高的下一个令牌。我们在GSM8K数据集上使用GPT-2模型测试了该方法,并展示了超过2%的性能提升。我们的实现代码位于 https://github.com/vividitytech/math_lm_rl。