Reinforcement Learning from Human Feedback (RLHF) has played a crucial role in the success of large models such as ChatGPT. RLHF is a reinforcement learning framework which combines human feedback to improve learning effectiveness and performance. However, obtaining preferences feedback manually is quite expensive in commercial applications. Some statistical commercial indicators are usually more valuable and always ignored in RLHF. There exists a gap between commercial target and model training. In our research, we will attempt to fill this gap with statistical business feedback instead of human feedback, using AB testing which is a well-established statistical method. Reinforcement Learning from Statistical Feedback (RLSF) based on AB testing is proposed. Statistical inference methods are used to obtain preferences for training the reward network, which fine-tunes the pre-trained model in reinforcement learning framework, achieving greater business value. Furthermore, we extend AB testing with double selections at a single time-point to ANT testing with multiple selections at different feedback time points. Moreover, we design numerical experiences to validate the effectiveness of our algorithm framework.
翻译:基于人类反馈的强化学习(RLHF)在ChatGPT等大模型成功中发挥了关键作用。RLHF是一种结合人类反馈以提升学习效果与性能的强化学习框架。然而,在商业应用中人工获取偏好反馈成本高昂。一些统计商业指标通常更具价值,却在RLHF中常被忽视。商业目标与模型训练之间存在差距。本研究中,我们尝试利用统计商业反馈替代人类反馈来弥合这一差距,采用已被充分验证的统计方法——AB测试。我们提出基于AB测试的统计反馈强化学习(RLSF)。通过统计推断方法获取偏好以训练奖励网络,在强化学习框架中对预训练模型进行微调,从而实现更大的商业价值。进一步地,我们将单时间点双重选择的AB测试扩展为不同反馈时间点多重选择的ANT测试。此外,我们设计了数值实验验证算法框架的有效性。