In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as "Q-Manipulation" (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.
翻译:本文针对强化学习中的奖励适应问题提出了一种新的解决方案,其中智能体基于一个或多个在相同域动态但不同奖励函数下先验学习的源行为,适应目标奖励函数。尽管从零开始学习目标行为是可行的,但考虑到可用的源行为,这通常效率低下。我们的工作通过操纵Q函数引入了一种新的RA方法。假设目标奖励函数是源奖励函数的已知函数,我们计算了Q函数的边界,并提出了一种迭代过程(类似于值迭代)来收紧这些边界。这样的边界使得在学习开始之前就能在目标域中进行动作剪枝。我们将此方法称为“Q-操纵”。该迭代过程假设能够访问一个轻量模型,这易于提供或学习。我们正式证明了在离散域下,Q-M不会影响返回策略的最优性,并表明其在概率意义上具有可证明的样本复杂度效率。Q-M在多种合成和仿真域中进行了评估,以证明其有效性、泛化性和实用性。