While solving complex manipulation tasks, manipulation policies often need to learn a set of diverse skills to accomplish these tasks. The set of skills is often quite multimodal - each one may have a quite distinct distribution of actions and states. Standard deep policy-learning algorithms often model policies as deep neural networks with a single output head (deterministic or stochastic). This structure requires the network to learn to switch between modes internally, which can lead to lower sample efficiency and poor performance. In this paper we explore a simple structure which is conducive to skill learning required for so many of the manipulation tasks. Specifically, we propose a policy architecture that sequentially executes different action heads for fixed durations, enabling the learning of primitive skills such as reaching and grasping. Our empirical evaluation on the Metaworld tasks reveals that this simple structure outperforms standard policy learning methods, highlighting its potential for improved skill acquisition.
翻译:在解决复杂操作任务时,操作策略通常需要学习一组多样化技能。这些技能往往具有多模态特性——每种技能可能对应截然不同的动作和状态分布。标准深度策略学习算法通常将策略建模为具有单一输出头(确定性或随机性)的深度神经网络。这种结构要求网络在内部切换模式,可能导致样本效率低下和性能欠佳。本文探索了一种有利于操作任务所需技能学习的简单结构。具体而言,我们提出了一种策略架构,该架构按固定时长顺序执行不同的动作头,从而支持抓取、触碰等基本技能的学习。在Metaworld任务上的实证评估表明,这种简单结构优于标准策略学习方法,突显了其在技能习得方面的改进潜力。