We study reinforcement learning in the presence of an unknown reward perturbation. Existing methodologies for this problem make strong assumptions including reward smoothness, known perturbations, and/or perturbations that do not modify the optimal policy. We study the case of unknown arbitrary perturbations that discretize and shuffle reward space, but have the property that the true reward belongs to the most frequently observed class after perturbation. This class of perturbations generalizes existing classes (and, in the limit, all continuous bounded perturbations) and defeats existing methods. We introduce an adaptive distributional reward critic and show theoretically that it can recover the true rewards under technical conditions. Under the targeted perturbation in discrete and continuous control tasks, we win/tie the highest return in 40/57 settings (compared to 16/57 for the best baseline). Even under the untargeted perturbation, we still win an edge over the baseline designed especially for that setting.
翻译:我们研究在存在未知奖励扰动情况下的强化学习。现有应对该问题的方法假设条件较强,包括奖励平滑性、已知扰动形式或扰动不改变最优策略。本研究考虑一类未知的任意扰动,该类扰动对奖励空间进行离散化和混洗,但具有如下特性:真实奖励在扰动后属于出现频率最高的类别。这类扰动涵盖了现有扰动类别(极限情况下包括所有连续有界扰动),并导致现有方法失效。我们提出一种自适应分布奖励评论家,并理论上证明在技术条件下它能恢复真实奖励。在离散与连续控制任务中的定向扰动设置下,我们在40/57种场景中获得最高收益或持平(最佳基线为16/57)。即使在非定向扰动设置中,我们仍比专为该场景设计的基线具有优势。