Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deception and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. To help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. In some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. We caution against blindly applying RLHF in partially observable settings and propose research directions to help tackle these challenges.
翻译:以往关于从人类反馈中强化学习(RLHF)的分析假设人类能够完全观测环境。当人类反馈仅基于部分观测时会发生什么?我们正式定义了两个失败情形:欺骗与过度合理化。通过将人类建模为关于轨迹信念的玻尔兹曼理性主体,我们证明了在何种条件下RLHF必然导致策略欺骗性地夸大其表现、过度合理化其行为以制造印象,或两者兼而有之。为帮助解决这些问题,我们从数学上刻画了环境的部分可观测性如何转化为学习到的回报函数中的(缺乏)模糊性。在某些情形下,考虑部分可观测性使得从理论上恢复回报函数及最优策略成为可能;而在其他情形下,则存在不可约的模糊性。我们警示不应在部分可观测环境中盲目应用RLHF,并提出若干研究方向以应对这些挑战。