Reinforcement Learning from Human Feedback (RLHF) has shown potential in qualitative tasks where clear objectives are lacking. However, its effectiveness is not fully realized when it is conceptualized merely as a tool to optimize average human preferences, especially in generative tasks that demand diverse model responses. Meanwhile, Quality Diversity (QD) algorithms excel at identifying diverse and high-quality solutions but often rely on manually crafted diversity metrics. This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach integrating human feedback into the QD framework. QDHF infers diversity metrics from human judgments of similarity among solutions, thereby enhancing the applicability and effectiveness of QD algorithms. Our empirical studies show that QDHF significantly outperforms state-of-the-art methods in automatic diversity discovery and matches the efficacy of using manually crafted metrics for QD on standard benchmarks in robotics and reinforcement learning. Notably, in a latent space illumination task, QDHF substantially enhances the diversity in images generated by a diffusion model and was more favorably received in user studies. We conclude by analyzing QDHF's scalability and the quality of its derived diversity metrics, emphasizing its potential to improve exploration and diversity in complex, open-ended optimization tasks. Source code is available on GitHub: https://github.com/ld-ing/qdhf.
翻译:基于人类反馈的强化学习(RLHF)在处理缺乏明确目标的定性任务时已展现出潜力。然而,当它仅被概念化为优化平均人类偏好的工具时,其有效性尚未充分实现,尤其是在需要多样化模型响应的生成任务中。与此同时,质量多样性(QD)算法在识别多样且高质量的解决方案方面表现出色,但通常依赖于人工设计的多样性指标。本文提出了一种新方法——通过人类反馈的质量多样性(QDHF),将人类反馈整合到QD框架中。QDHF从人类对解决方案相似性的判断中推断出多样性指标,从而增强QD算法的适用性和有效性。我们的实证研究表明,QDHF在自动多样性发现方面显著优于现有最先进方法,并且在机器人和强化学习的标准基准测试中,其性能与使用人工设计指标的QD方法相当。值得注意的是,在潜在空间照明任务中,QDHF显著增强了扩散模型生成图像的多样性,并在用户研究中获得了更积极的反馈。我们最后分析了QDHF的可扩展性及其衍生多样性指标的品质,强调了其在复杂、开放式优化任务中提升探索能力和多样性的潜力。源代码可在GitHub上获取:https://github.com/ld-ing/qdhf。