In reinforcement learning, abstraction methods that remove unnecessary information from the observation are commonly used to learn policies which generalize better to unseen tasks. However, these methods often overlook a crucial weakness: the function which extracts the reduced-information representation has unknown generalization ability in unseen observations. In this paper, we address this problem by presenting an information removal method which more reliably generalizes to new states. We accomplish this by using a learned masking function which operates on, and is integrated with, the attention weights within an attention-based policy network. We demonstrate that our method significantly improves policy generalization to unseen tasks in the Procgen benchmark compared to standard PPO and masking approaches.
翻译:在强化学习中,通过从观测中移除不必要信息的抽象方法通常被用来学习能够更好地泛化到未见任务的策略。然而,这些方法往往忽视了一个关键缺陷:用于提取降维信息表示的函数在未见观测中的泛化能力是未知的。本文通过提出一种能够更可靠地泛化到新状态的信息移除方法来解决这一问题。我们通过使用一个学习得到的掩码函数来实现这一目标,该函数作用于基于注意力的策略网络中的注意力权重,并与这些权重集成。实验表明,在Procgen基准测试中,与标准PPO及掩码方法相比,我们的方法显著提升了策略在未见任务上的泛化能力。