ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

Alignment is of critical importance for training large language models (LLMs). The predominant strategy to address this is through Reinforcement Learning from Human Feedback (RLHF), where PPO serves as the de-facto algorithm. Yet, PPO is known to suffer from computational inefficiency, which is a challenge that this paper aims to address. We identify three important properties in RLHF tasks: fast simulation, deterministic transitions, and trajectory-level rewards, which are not leveraged in PPO. Based on such observations, we develop a new algorithm tailored for RLHF, called ReMax. The algorithm design of ReMax is built on a celebrated algorithm REINFORCE but is equipped with a new variance-reduction technique. Our method has three-fold advantages over PPO: first, ReMax is simple to implement and removes many hyper-parameters in PPO, which are scale-sensitive and laborious to tune. Second, ReMax saves about 50% memory usage in principle. As a result, PPO runs out-of-memory when fine-tuning a Llama2 (7B) model on 8xA100-40GB GPUs, whereas ReMax can afford training. This memory improvement is achieved by removing the value model in PPO. Third, based on our calculations, we find that even assuming PPO can afford the training of Llama2 (7B), it would still run about 2x slower than ReMax. This is due to the computational overhead of the value model, which does not exist in ReMax. Importantly, the above computational improvements do not sacrifice the performance. We hypothesize these advantages can be maintained in larger-scaled models. Our implementation of ReMax is available at https://github.com/liziniu/ReMax

翻译：[translated abstract in Chinese] 对齐对于训练大型语言模型（LLMs）至关重要。解决这一问题的主流策略是基于人类反馈的强化学习（RLHF），其中PPO是事实上的标准算法。然而，PPO已知存在计算效率低下的问题，这正是本文试图解决的挑战。我们识别出RLHF任务中的三个重要特性：快速模拟、确定性转移和轨迹级奖励，而这些特性在PPO中未被充分利用。基于这些观察，我们开发了一种专为RLHF定制的新算法，称为ReMax。ReMax的算法设计基于著名的REINFORCE算法，但配备了一种新的方差缩减技术。与PPO相比，我们的方法具有三方面优势：首先，ReMax实现简单，去除了PPO中许多对尺度敏感且调优费力的超参数。其次，ReMax原则上节省了约50%的内存使用。因此，当使用8块A100-40GB GPU微调Llama2（7B）模型时，PPO会内存不足，而ReMax则能负担训练。这一内存改进是通过去除PPO中的值模型实现的。第三，根据我们的计算，即使假设PPO能负担Llama2（7B）的训练，其运行速度仍会比ReMax慢约2倍。这是由于ReMax中不存在值模型的计算开销。重要的是，上述计算改进并未牺牲性能。我们假设这些优势在更大规模的模型中也能保持。我们的ReMax实现可在https://github.com/liziniu/ReMax获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日