Deep reinforcement learning (DRL) frameworks are increasingly used to solve high-dimensional continuous control tasks in robotics. However, due to the lack of sample efficiency, applying DRL for online learning is still practically infeasible in the robotics domain. One reason is that DRL agents do not leverage the solution of previous tasks for new tasks. Recent work on multi-task DRL agents based on successor features (SFs) has proven to be quite promising in increasing sample efficiency. In this work, we present a new approach that unifies two prior multi-task RL frameworks, SF-GPI and value composition, and adapts them to the continuous control domain. We exploit compositional properties of successor features to compose a policy distribution from a set of primitives without training any new policy. Lastly, to demonstrate the multi-tasking mechanism, we present our proof-of-concept benchmark environments, Pointmass and Pointer, based on IsaacGym, which facilitates large-scale parallelization to accelerate the experiments. Our experimental results show that our multi-task agent has single-task performance on par with soft actor-critic (SAC), and the agent can successfully transfer to new unseen tasks. We provide our code as open-source at "https://github.com/robot-perception-group/concurrent_composition" for the benefit of the community.
翻译:深度强化学习框架在解决机器人学高维连续控制任务中应用日益广泛。然而,由于样本效率不足,将深度强化学习应用于在线学习在机器人领域仍不具实际可行性。其原因之一是深度强化学习代理未能利用先前任务的知识解决新任务。近期基于后继特征的多任务深度强化学习方法在提升样本效率方面展现出显著潜力。本文提出了一种新方法,统一了多任务强化学习的两种先前框架——后继特征通用策略改进与价值组合,并将其适配至连续控制领域。我们利用后继特征的组合特性,无需训练新策略即可从基元集合中组合出策略分布。最后,为展示多任务机制,我们基于IsaacGym平台提出了概念验证基准环境Pointmass与Pointer,该平台支持大规模并行化以加速实验。实验结果表明,我们的多任务代理在单任务性能上与软演员-评论家算法相当,且能成功迁移至未见新任务。我们已将代码开源至"https://github.com/robot-perception-group/concurrent_composition",以惠及社区。