Offline imitation learning (IL) refers to learning expert behavior solely from demonstrations, without any additional interaction with the environment. Despite significant advances in offline IL, existing techniques find it challenging to learn policies for long-horizon tasks and require significant re-training when task specifications change. Towards addressing these limitations, we present GO-DICE an offline IL technique for goal-conditioned long-horizon sequential tasks. GO-DICE discerns a hierarchy of sub-tasks from demonstrations and uses these to learn separate policies for sub-task transitions and action execution, respectively; this hierarchical policy learning facilitates long-horizon reasoning. Inspired by the expansive DICE-family of techniques, policy learning at both the levels transpires within the space of stationary distributions. Further, both policies are learnt with goal conditioning to minimize need for retraining when task goals change. Experimental results substantiate that GO-DICE outperforms recent baselines, as evidenced by a marked improvement in the completion rate of increasingly challenging pick-and-place Mujoco robotic tasks. GO-DICE is also capable of leveraging imperfect demonstration and partial task segmentation when available, both of which boost task performance relative to learning from expert demonstrations alone.
翻译:离线模仿学习(Offline Imitation Learning, IL)指仅从示范数据中学习专家行为,无需与环境进行额外交互。尽管离线IL领域已取得显著进展,现有技术在应对长视野任务时仍面临策略学习困难,且当任务规格发生变化时需大量重新训练。为克服这些局限,我们提出GO-DICE——一种面向目标条件长视野序列任务的离线IL技术。该技术从示范数据中识别子任务层级结构,并分别学习子任务转换策略与动作执行策略;这种层级化策略学习促进了长视野推理能力。受广泛应用的DICE技术家族启发,两个层面的策略学习均在平稳分布空间内进行。此外,两项策略均采用目标条件化设计,以最小化任务目标变更时的重新训练需求。实验结果表明,GO-DICE显著优于近期基准方法,这体现在日益复杂的Mujoco机器人抓取与放置任务完成率的显著提升上。当存在不完美示范数据及部分任务分割信息时,GO-DICE同样能加以利用,相较于仅从专家示范学习可进一步提升任务性能。