Interactive imitation learning makes an agent's control policy robust by stepwise supervisions from an expert. The recent algorithms mostly employ expert-agent switching systems to reduce the expert's burden by limitedly selecting the supervision timing. However, this approach is useful only for static tasks; in dynamic tasks, timing discrepancies cause abrupt changes in actions, losing the robot's dynamic stability. This paper therefore proposes a novel method, named CubeDAgger, which improves robustness with less dynamic stability violations even for dynamic tasks. The proposed method is designed on a baseline, EnsembleDAgger, with three improvements. The first adds a regularization to explicitly activate the threshold for deciding the supervision timing. The second transforms the expert-agent switching system to an optimal consensus system of multiple action candidates. Third, autoregressive colored noise is injected to the agent's actions for time-consistent exploration. These improvements are verified by simulations, showing that the trained policies are sufficiently robust while maintaining dynamic stability during interaction. Finally, real-robot scooping experiments with a human expert demonstrate that the proposed method can learn robust policies from scratch based on just 30 minutes of interaction. https://youtu.be/kBl3SCTnVEM
翻译:交互式模仿学习通过专家的逐步监督来增强智能体控制策略的鲁棒性。近期算法多采用专家-智能体切换系统,通过限制性选择监督时机以减轻专家负担。然而该方法仅适用于静态任务:在动态任务中,时序差异会导致动作突变,破坏机器人的动态稳定性。本文提出CubeDAgger方法,即使在动态任务中也能在较少违反动态稳定性的前提下提升鲁棒性。该方法以EnsembleDAgger为基线进行三项改进:第一,引入正则化显式激活判定监督时机的阈值;第二,将专家-智能体切换系统转化为多动作候选的优化共识系统;第三,向智能体动作注入自回归有色噪声以实现时间一致性探索。仿真实验验证了这些改进的有效性,表明训练策略在维持交互过程动态稳定性的同时具备足够鲁棒性。最后,在人类专家参与的实物机器人挖掘实验中,该方法仅需30分钟交互即可从零开始学习鲁棒策略。https://youtu.be/kBl3SCTnVEM