Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at https://uni-rlhf.github.io/.
翻译:基于人类反馈的强化学习(RLHF)通过对齐人类偏好,无需昂贵的人工奖励设计即可执行任务,因而受到广泛关注。在不同环境中,考虑多样的人类反馈类型及各类学习方法至关重要。然而,由于缺乏标准化的标注平台和广泛使用的统一基准,量化多样反馈下RLHF的进展面临挑战。为弥补这一空白,我们提出了Uni-RLHF——专为RLHF设计的综合系统实现。该系统旨在提供从真实人类反馈出发的完整工作流,推动实际问题的发展。Uni-RLHF包含三个包:1)通用多反馈标注平台,2)大规模众包反馈数据集,3)模块化离线RLHF基线实现。Uni-RLHF开发了适配多种反馈类型的用户友好型标注界面,兼容广泛的主流RL环境。随后,我们建立了系统化的众包标注流程,生成了包含30余项流行任务、超过1500万步的大规模标注数据集。大量实验表明,相较于精心设计的人工奖励,所收集数据集的结果展现出具有竞争力的性能。我们评估了多种设计选择,并针对其优势及潜在改进方向提出见解。我们希望构建有价值的开源平台、数据集及基线,以促进基于真实人类反馈的更鲁棒、更可靠的RLHF解决方案的发展。网站地址为:https://uni-rlhf.github.io/。