Reinforcement Learning (RL) has demonstrated promising results in learning policies for complex tasks, but it often suffers from low sample efficiency and limited transferability. Hierarchical RL (HRL) methods aim to address the difficulty of learning long-horizon tasks by decomposing policies into skills, abstracting states, and reusing skills in new tasks. However, many HRL methods require some initial task success to discover useful skills, which paradoxically may be very unlikely without access to useful skills. On the other hand, reward-free HRL methods often need to learn far too many skills to achieve proper coverage in high-dimensional domains. In contrast, we introduce the Chain of Interaction Skills (COInS) algorithm, which focuses on controllability in factored domains to identify a small number of task-agnostic skills that still permit a high degree of control. COInS uses learned detectors to identify interactions between state factors and then trains a chain of skills to control each of these factors successively. We evaluate COInS on a robotic pushing task with obstacles-a challenging domain where other RL and HRL methods fall short. We also demonstrate the transferability of skills learned by COInS, using variants of Breakout, a common RL benchmark, and show 2-3x improvement in both sample efficiency and final performance compared to standard RL baselines.
翻译:强化学习(RL)在复杂任务策略学习方面展现了令人鼓舞的成果,但其通常面临样本效率低与可迁移性有限的问题。分层强化学习(HRL)方法旨在通过将策略分解为技能、抽象化状态以及在新任务中重用技能,来解决长时域任务的学习困难。然而,许多HRL方法需要借助初始任务成功来发现有用技能,而矛盾的是,若无有用技能辅助,这种成功本身可能极难实现。另一方面,无奖励HRL方法往往需要学习过多技能才能在高维域中实现充分覆盖。针对这一问题,我们提出"交互技能链(COInS)"算法,该算法聚焦于因子化域的可控性,识别出少量仍能提供高度控制能力的任务无关技能。COInS利用学习到的检测器识别状态因子间的交互关系,继而训练技能链逐步控制每个因子。我们在地形障碍物环境中的机器人推箱任务(其他RL与HRL方法难有建树的挑战性领域)上评估了COInS。此外,通过使用通用RL基准测试Breakout的变体,我们证明了COInS习得技能的可迁移性,结果显示其样本效率与最终性能相比标准RL基线提升了2-3倍。