We study the theoretical aspects of Reinforced Language Models (RLMs) from a bi-objective optimization perspective. Specifically, we consider the RLMs as a Pareto optimization problem that maximizes the two conflicting objectives, i.e., reward objective and likelihood objectives, simultaneously. Our main contribution consists of three parts. First, we establish the theoretical foundations of RLM as a Pareto optimization problem by presenting Reward Upper BOund (RUBO) and Pareto optimality. Our theoretical outcomes are supported by not only deductive proofs but also empirical results. Second, we propose Reward Dropout, a simple yet powerful method that guarantees to improve a bi-objective optimization of RLM. Lastly, we demonstrate that the Reward Dropout is consistently effective across five benchmark datasets and four benchmark LLMs, meaning that the Reward Dropout significantly improves the optimization performance of RLMs.
翻译:我们从双目标优化视角研究强化语言模型的理论方面。具体而言,我们将强化语言模型视为一个Pareto优化问题,该问题同时最大化两个冲突目标,即奖励目标和似然目标。我们的主要贡献包含三部分。首先,我们通过提出奖励上界和Pareto最优性,建立了强化语言模型作为Pareto优化问题的理论基础。我们的理论结果不仅得到演绎证明的支持,还得到实证结果的验证。其次,我们提出奖励丢弃这一简单而有效的方法,该方法保证能改进强化语言模型的双目标优化。最后,我们证明奖励丢弃在五个基准数据集和四个基准大语言模型上持续有效,这意味着奖励丢弃显著提高了强化语言模型的优化性能。