Reinforcement Learning (RL) algorithms are known to scale poorly to environments with many available actions, requiring numerous samples to learn an optimal policy. The traditional approach of considering the same fixed action space in every possible state implies that the agent must understand, while also learning to maximize its reward, to ignore irrelevant actions such as $\textit{inapplicable actions}$ (i.e. actions that have no effect on the environment when performed in a given state). Knowing this information can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy. While this technique has been formalized for quite some time within the Automated Planning community with the concept of precondition in the STRIPS language, RL algorithms have never formally taken advantage of this information to prune the search space to explore. This is typically done in an ad-hoc manner with hand-crafted domain logic added to the RL algorithm. In this paper, we propose a more systematic approach to introduce this knowledge into the algorithm. We (i) standardize the way knowledge can be manually specified to the agent; and (ii) present a new framework to autonomously learn the partial action model encapsulating the precondition of an action jointly with the policy. We show experimentally that learning inapplicable actions greatly improves the sample efficiency of the algorithm by providing a reliable signal to mask out irrelevant actions. Moreover, we demonstrate that thanks to the transferability of the knowledge acquired, it can be reused in other tasks and domains to make the learning process more efficient.
翻译:强化学习(RL)算法在处理具有大量可用动作的环境时,已知其扩展性较差,需要大量样本才能学会最优策略。传统方法在每种可能状态下都考虑相同的固定动作空间,这意味着智能体必须在学习如何最大化奖励的同时,理解并忽略不相关动作,例如$\textit{不适用动作}$(即在给定状态下执行时对环境无影响的动作)。知晓这一信息可通过从策略分布中屏蔽不适用动作,仅探索与寻找最优策略相关的动作,从而降低RL算法的样本复杂度。尽管自动规划领域早已借助STRIPS语言中的前提概念形式化这一技术,但RL算法从未正式利用该信息来剪枝探索空间。当前通常采用临时性方法,通过向RL算法中添加手工定制的领域逻辑来实现。本文提出了一种更系统的方法,将此类知识引入算法中。我们(i)标准化了向智能体手动指定知识的方式;(ii)提出一个新框架,用于与策略联合自主学习包含动作前提的部分动作模型。实验表明,通过学习不适用动作,可提供可靠信号来屏蔽无关动作,从而显著提升算法的样本效率。此外,我们证明由于所学知识的可迁移性,这些知识可复用于其他任务和领域,使学习过程更加高效。