Recent advances in reinforcement learning (RL) algorithms aim to enhance the performance of language models at scale. Yet, there is a noticeable absence of a cost-effective and standardized testbed tailored to evaluating and comparing these algorithms. To bridge this gap, we present a generalized version of the 24-Puzzle: the $(N,K)$-Puzzle, which challenges language models to reach a target value $K$ with $N$ integers. We evaluate the effectiveness of established RL algorithms such as Proximal Policy Optimization (PPO), alongside novel approaches like Identity Policy Optimization (IPO) and Direct Policy Optimization (DPO).
翻译:近期强化学习算法的发展旨在提升语言模型在大规模应用中的性能。然而,目前尚缺乏一种经济高效且标准化的测试平台来评估和比较这些算法。为解决这一不足,我们提出了24点谜题的一种泛化版本:$(N,K)$-谜题。该谜题要求语言模型利用$N$个整数通过运算得到目标值$K$。我们评估了包括近端策略优化(PPO)在内的经典强化学习算法,以及身份策略优化(IPO)和直接策略优化(DPO)等新型方法的有效性。