In this study, we delve into an emerging optimization challenge involving a black-box objective function that can only be gauged via a ranking oracle-a situation frequently encountered in real-world scenarios, especially when the function is evaluated by human judges. Such challenge is inspired from Reinforcement Learning with Human Feedback (RLHF), an approach recently employed to enhance the performance of Large Language Models (LLMs) using human guidance. We introduce ZO-RankSGD, an innovative zeroth-order optimization algorithm designed to tackle this optimization problem, accompanied by theoretical assurances. Our algorithm utilizes a novel rank-based random estimator to determine the descent direction and guarantees convergence to a stationary point. Moreover, ZO-RankSGD is readily applicable to policy optimization problems in Reinforcement Learning (RL), particularly when only ranking oracles for the episode reward are available. Last but not least, we demonstrate the effectiveness of ZO-RankSGD in a novel application: improving the quality of images generated by a diffusion generative model with human ranking feedback. Throughout experiments, we found that ZO-RankSGD can significantly enhance the detail of generated images with only a few rounds of human feedback. Overall, our work advances the field of zeroth-order optimization by addressing the problem of optimizing functions with only ranking feedback, and offers a new and effective approach for aligning Artificial Intelligence (AI) with human intentions.
翻译:本研究探讨了一种新兴的优化挑战,其目标函数为黑箱形式且仅能通过排序谕示(ranking oracle)进行评估——这一情境在现实应用中频繁出现,尤其当函数由人类评判者评估时。该挑战的灵感源于基于人类反馈的强化学习(RLHF),这一方法近期被用于借助人类指导提升大型语言模型(LLMs)的性能。我们提出ZO-RankSGD,一种创新的零阶优化算法,旨在解决此优化问题并提供理论保证。该算法采用新型基于排序的随机估计器确定下降方向,并保证收敛至平稳点。此外,ZO-RankSGD可直接应用于强化学习(RL)中的策略优化问题,特别适用于仅有回合奖励的排序谕示可用场景。最后,我们通过一项新颖应用验证了ZO-RankSGD的有效性:利用人类排序反馈提升扩散生成模型所生成图像的质量。实验表明,仅需少量几轮人类反馈,ZO-RankSGD即可显著增强生成图像的细节。总体而言,本研究通过解决仅有排序反馈的函数优化问题,推动了零阶优化领域的发展,并为人工智能(AI)与人类意图的对齐提供了新的有效途径。