Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. Under the new assumption that the human's partial observability is known and accounted for, we then analyze how much information the feedback process provides about the return function. We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity. We propose exploratory research directions to help tackle these challenges and caution against blindly applying RLHF in partially observable settings.
翻译:过往对人类反馈强化学习(RLHF)的分析均假设人类评估者能完全观测环境。当人类反馈仅基于部分观测时会发生什么?我们正式定义两种失效情形:欺骗性虚报与过度合理化。通过将人类建模为基于轨迹信念的玻尔兹曼理性体,我们证明了在某些条件下,RLHF必然会导致策略出现欺骗性虚报表现、通过过度合理化行为制造印象,或两者兼具。在新的假设下(已知并考虑人类的部分可观测性),我们进而分析反馈过程能为回报函数提供多少信息。研究表明,有时人类反馈能唯一确定回报函数(至多相差一个加性常数),但在其他现实场景中,存在不可约简的模糊性。我们提出探索性研究方向以应对这些挑战,并警示在部分可观测场景中盲目应用RLHF的风险。