Reinforcement learning demonstrates significant potential in automatically building control policies in numerous domains, but shows low efficiency when applied to robot manipulation tasks due to the curse of dimensionality. To facilitate the learning of such tasks, prior knowledge or heuristics that incorporate inherent simplification can effectively improve the learning performance. This paper aims to define and incorporate the natural symmetry present in physical robotic environments. Then, sample-efficient policies are trained by exploiting the expert demonstrations in symmetrical environments through an amalgamation of reinforcement and behavior cloning, which gives the off-policy learning process a diverse yet compact initiation. Furthermore, it presents a rigorous framework for a recent concept and explores its scope for robot manipulation tasks. The proposed method is validated via two point-to-point reaching tasks of an industrial arm, with and without an obstacle, in a simulation experiment study. A PID controller, which tracks the linear joint-space trajectories with hard-coded temporal logic to produce interim midpoints, is used to generate demonstrations in the study. The results of the study present the effect of the number of demonstrations and quantify the magnitude of behavior cloning to exemplify the possible improvement of model-free reinforcement learning in common manipulation tasks. A comparison study between the proposed method and a traditional off-policy reinforcement learning algorithm indicates its advantage in learning performance and potential value for applications.
翻译:强化学习在众多领域中展现出自动构建控制策略的巨大潜力,但因维度灾难问题,在应用于机器人操作任务时效率较低。为促进此类任务的学习,融合内在简化的先验知识或启发式方法可有效提升学习性能。本文旨在定义并利用物理机器人环境中固有的对称性,通过强化学习与行为克隆的融合方法,利用对称环境中的专家演示示例训练样本高效的策略,为离线策略学习过程提供多样化且紧凑的初始状态。此外,本文为该新近概念建立了严谨框架,并探索其在机器人操作任务中的应用范围。通过仿真实验研究,在包含障碍物与无障碍物两种场景下对工业机械臂的逐点到达任务进行验证。研究中使用遵循线性关节空间轨迹的PID控制器(该控制器通过硬编码时序逻辑生成中间目标点)来生成演示示例。研究结果揭示了演示示例数量的影响,并量化了行为克隆的幅度,以示例说明无模型强化学习在常规操作任务中可能的改进。本文提出的方法与传统离线策略强化学习算法的对比研究表明,该方法在学习性能上具有优势,并展现出应用潜力。