There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an agent that can cooperate with humans in a zero-shot fashion without using any human data. The typical workflow is to first repeatedly run self-play (SP) to build a policy pool and then train the final adaptive policy against this pool. A crucial limitation of this framework is that every policy in the pool is optimized w.r.t. the environment reward function, which implicitly assumes that the testing partners of the adaptive policy will be precisely optimizing the same reward function as well. However, human objectives are often substantially biased according to their own preferences, which can differ greatly from the environment reward. We propose a more general framework, Hidden-Utility Self-Play (HSP), which explicitly models human biases as hidden reward functions in the self-play objective. By approximating the reward space as linear functions, HSP adopts an effective technique to generate an augmented policy pool with biased policies. We evaluate HSP on the Overcooked benchmark. Empirical results show that our HSP method produces higher rewards than baselines when cooperating with learned human models, manually scripted policies, and real humans. The HSP policy is also rated as the most assistive policy based on human feedback.
翻译:近期研究趋势表明,多智能体强化学习(MARL)可在不使用任何人类数据的情况下训练智能体,使其以零样本方式与人类协作。典型流程是:先通过反复自我对弈(SP)构建策略池,再针对该策略池训练最终自适应策略。该框架的关键局限在于,策略池中每个策略均基于环境奖励函数优化,这隐含假设自适应策略的测试伙伴也将精准优化同一奖励函数。然而,人类目标常因个体偏好存在显著偏差,可能与环境奖励大相径庭。我们提出更通用的框架——隐效用自我对弈(HSP),该框架在自我对弈目标中显式建模人类偏差为隐藏奖励函数。通过将奖励空间近似为线性函数,HSP采用高效技术生成含偏差策略的增强策略池。我们在Overcooked基准上评估HSP,实验结果表明:当与训练人类模型、手动编写策略及真实人类协作时,HSP方法产生的奖励高于基线。基于人类反馈,HSP策略也被评为最具辅助性的策略。