The large action space is one fundamental obstacle to deploying Reinforcement Learning methods in the real world. The numerous redundant actions will cause the agents to make repeated or invalid attempts, even leading to task failure. Although current algorithms conduct some initial explorations for this issue, they either suffer from rule-based systems or depend on expert demonstrations, which significantly limits their applicability in many real-world settings. In this work, we examine the theoretical analysis of what action can be eliminated in policy optimization and propose a novel redundant action filtering mechanism. Unlike other works, our method constructs the similarity factor by estimating the distance between the state distributions, which requires no prior knowledge. In addition, we combine the modified inverse model to avoid extensive computation in high-dimensional state space. We reveal the underlying structure of action spaces and propose a simple yet efficient redundant action filtering mechanism named No Prior Mask (NPM) based on the above techniques. We show the superior performance of our method by conducting extensive experiments on high-dimensional, pixel-input, and stochastic problems with various action redundancy. Our code is public online at https://github.com/zhongdy15/npm.
翻译:大动作空间是强化学习方法在实际部署中面临的根本障碍之一。大量冗余动作会导致智能体重复或无效尝试,甚至引发任务失败。尽管现有算法就此问题进行了初步探索,但它们或依赖基于规则的体系,或依赖专家演示,这极大限制了其在诸多现实场景中的适用性。本研究从理论层面分析了策略优化中哪些动作可被消除,并提出了一种新颖的冗余动作过滤机制。与其他研究不同,我们的方法通过估计状态分布间的距离来构建相似性因子,无需任何先验知识。此外,我们结合改进的逆模型以避免高维状态空间中的大量计算。我们揭示了动作空间的潜在结构,并基于上述技术提出了一种简单高效的冗余动作过滤机制——无先验掩码(No Prior Mask, NPM)。通过在具有不同动作冗余程度的高维、像素输入及随机问题上开展广泛实验,我们展示了该方法优越的性能。我们的代码已公开于 https://github.com/zhongdy15/npm。