We study Reinforcement Learning from Human Feedback (RLHF), where multiple individuals with diverse preferences provide feedback strategically to sway the final policy in their favor. We show that existing RLHF methods are not strategyproof, which can result in learning a substantially misaligned policy even when only one out of $k$ individuals reports their preferences strategically. In turn, we also find that any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, highlighting an inherent trade-off between incentive alignment and policy alignment. We then propose a pessimistic median algorithm that, under appropriate coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of individuals and samples increases.
翻译:本文研究基于人类反馈的强化学习(RLHF),其中具有不同偏好的多个个体会策略性地提供反馈,以促使最终策略朝有利于自身的方向倾斜。我们证明现有的RLHF方法不具备策略证明性,即使仅有$k$个个体中的一位策略性地报告其偏好,也可能导致学习到严重失准的策略。相应地,我们也发现任何具有策略证明性的RLHF算法其性能必然比最优策略差$k$倍,这揭示了激励对齐与策略对齐之间固有的权衡关系。随后我们提出一种悲观中位数算法,在适当的覆盖假设下,该算法具有近似策略证明性,并随着个体数量和样本量的增加而收敛至最优策略。