Reinforcement Learning from Human Feedback (RLHF) has shown potential in qualitative tasks where easily defined performance measures are lacking. However, there are drawbacks when RLHF is commonly used to optimize for average human preferences, especially in generative tasks that demand diverse model responses. Meanwhile, Quality Diversity (QD) algorithms excel at identifying diverse and high-quality solutions but often rely on manually crafted diversity metrics. This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach that progressively infers diversity metrics from human judgments of similarity among solutions, thereby enhancing the applicability and effectiveness of QD algorithms in complex and open-ended domains. Empirical studies show that QDHF significantly outperforms state-of-the-art methods in automatic diversity discovery and matches the efficacy of QD with manually crafted diversity metrics on standard benchmarks in robotics and reinforcement learning. Notably, in open-ended generative tasks, QDHF substantially enhances the diversity of text-to-image generation from a diffusion model and is more favorably received in user studies. We conclude by analyzing QDHF's scalability, robustness, and quality of derived diversity metrics, emphasizing its strength in open-ended optimization tasks. Code and tutorials are available at https://liding.info/qdhf.
翻译:人类反馈强化学习(RLHF)在缺乏明确定义性能指标的定性任务中展现出潜力。然而,当RLHF通常被用于优化平均人类偏好时存在缺陷,尤其是在需要多样化模型响应的生成任务中。与此同时,质量多样性(QD)算法擅长识别多样且高质量的解决方案,但往往依赖于人工设计的多样性度量指标。本文提出了基于人类反馈的质量多样性(QDHF)这一新方法,该方法通过逐步从人类对解决方案相似性的判断中推断出多样性度量,从而增强了QD算法在复杂和开放式领域的适用性与有效性。实证研究表明,在机器人学和强化学习的标准基准测试中,QDHF在自动多样性发现方面显著优于现有先进方法,并与采用人工设计多样性度量的QD方法效果相当。值得注意的是,在开放式生成任务中,QDHF显著增强了基于扩散模型的文本到图像生成的多样性,并在用户研究中获得了更积极的反馈。最后,我们分析了QDHF的可扩展性、鲁棒性以及所推导多样性度量的质量,强调了其在开放式优化任务中的优势。代码和教程可在 https://liding.info/qdhf 获取。