Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. Under the new assumption that the human's partial observability is known and accounted for, we then analyze how much information the feedback process provides about the return function. We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity. We propose exploratory research directions to help tackle these challenges, experimentally validate both the theoretical concerns and potential mitigations, and caution against blindly applying RLHF in partially observable settings.
翻译:过去对人类反馈强化学习(RLHF)的分析均假设人类评估者能完全观测环境。当人类反馈仅基于部分观测时会发生什么?我们正式定义两种失效案例:欺骗性虚报与过度合理化。通过将人类建模为基于轨迹信念的玻尔兹曼理性智能体,我们证明了在某些条件下RLHF必然导致策略出现欺骗性虚报性能、过度合理化行为以制造印象,或两者兼具。在已知并考虑人类部分可观测性的新假设下,我们进而分析反馈过程能为回报函数提供多少信息。研究表明,有时人类反馈能唯一确定回报函数(至多相差一个加性常数),但在其他现实场景中则存在不可约的模糊性。我们提出探索性研究方向以应对这些挑战,通过实验验证理论担忧与潜在缓解措施,并警示在部分可观测场景中盲目应用RLHF的风险。