Reinforcement learning from human feedback (RLHF) has exhibited the potential to enhance the performance of foundation models for qualitative tasks. Despite its promise, its efficacy is often restricted when conceptualized merely as a mechanism to maximize learned reward models of averaged human preferences, especially in areas such as image generation which demand diverse model responses. Meanwhile, quality diversity (QD) algorithms, dedicated to seeking diverse, high-quality solutions, are often constrained by the dependency on manually defined diversity metrics. Interestingly, such limitations of RLHF and QD can be overcome by blending insights from both. This paper introduces Quality Diversity through Human Feedback (QDHF), which employs human feedback for inferring diversity metrics, expanding the applicability of QD algorithms. Empirical results reveal that QDHF outperforms existing QD methods regarding automatic diversity discovery, and matches the search capabilities of QD with human-constructed metrics. Notably, when deployed for a latent space illumination task, QDHF markedly enhances the diversity of images generated by a Diffusion model. The study concludes with an in-depth analysis of QDHF's sample efficiency and the quality of its derived diversity metrics, emphasizing its promise for enhancing exploration and diversity in optimization for complex, open-ended tasks.
翻译:基于人类反馈的强化学习(RLHF)在提升基础模型处理定性任务性能方面展现出潜力。然而,当其仅被概念化为最大化平均人类偏好的学习奖励模型机制时,其有效性往往受到限制——尤其在图像生成等需要多样化模型响应的领域。与此同时,致力于寻找多样化优质解的质量多样性(QD)算法,常因依赖人工定义多样性指标而受限。有趣的是,融合RLHF与QD的见解可突破上述局限。本文提出"通过人类反馈的质量多样性"(QDHF)方法,通过利用人类反馈推断多样性指标,拓展了QD算法的适用范围。实验结果表明,QDHF在自动多样性发现方面优于现有QD方法,并具备与使用人工构建指标的QD算法相当的搜索能力。值得注意的是,当应用于潜在空间照明任务时,QDHF显著增强了扩散模型生成图像的多样性。研究最后深入分析了QDHF的样本效率及其推导出的多样性指标质量,强调其在复杂开放式任务优化中增强探索与多样性的潜力。