Combined with demonstrations, deep reinforcement learning can efficiently develop policies for manipulators. However, it takes time to collect sufficient high-quality demonstrations in practice. And human demonstrations may be unsuitable for robots. The non-Markovian process and over-reliance on demonstrations are further challenges. For example, we found that RL agents are sensitive to demonstration quality in manipulation tasks and struggle to adapt to demonstrations directly from humans. Thus it is challenging to leverage low-quality and insufficient demonstrations to assist reinforcement learning in training better policies, and sometimes, limited demonstrations even lead to worse performance. We propose a new algorithm named TD3fG (TD3 learning from a generator) to solve these problems. It forms a smooth transition from learning from experts to learning from experience. This innovation can help agents extract prior knowledge while reducing the detrimental effects of the demonstrations. Our algorithm performs well in Adroit manipulator and MuJoCo tasks with limited demonstrations.
翻译:结合演示,深度强化学习能够高效地为机械臂制定策略。然而,在实践中收集足够数量的高质量演示需要时间,且人类演示可能不适用于机器人。非马尔可夫过程和对演示的过度依赖是进一步的挑战。例如,我们发现强化学习智能体在操作任务中对演示质量敏感,且难以适应直接来自人类的演示。因此,利用低质量和不足的演示来辅助强化学习训练出更好的策略具有挑战性,有时有限的演示甚至会导致更差的性能。我们提出了一种名为TD3fG(从生成器学习的TD3算法)的新算法来解决这些问题。该算法实现了从向专家学习到从经验学习的平滑过渡。这一创新有助于智能体在减少演示负面影响的同时提取先验知识。我们的算法在有限演示下的Adroit机械臂和MuJoCo任务中表现良好。