Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at https://uni-rlhf.github.io/.
翻译:基于人类反馈的强化学习(RLHF)通过对齐人类偏好,无需昂贵的手动奖励设计即可执行任务,因而受到广泛关注。在不同环境中考虑多样化的反馈类型及多种学习方法至关重要。然而,由于缺乏标准化的标注平台和广泛使用的统一基准,量化多样化反馈下RLHF的进展面临挑战。为弥合这一差距,我们提出Uni-RLHF,一个专为RLHF量身定制的综合系统实现,旨在提供从真实人类反馈出发的完整工作流,推动实际问题的发展。Uni-RLHF包含三个组件:1)通用多反馈标注平台,2)大规模众包反馈数据集,3)模块化离线RLHF基线实现。Uni-RLHF开发了一个用户友好的标注界面,适配多种反馈类型,并与主流RL环境广泛兼容。随后,我们建立了系统的众包标注流程,生成了涵盖30余项热门任务、超过1500万步的大规模标注数据集。通过大量实验,收集的数据集结果表明,其性能与精心设计的手动奖励相比具有竞争力。我们评估了多种设计选择,并对其优势及改进空间提供了见解。我们希望构建有价值的开源平台、数据集和基线,以促进基于真实人类反馈的更稳健、可靠RLHF解决方案的开发。项目网站见https://uni-rlhf.github.io/。