PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav

We study ObjectGoal Navigation -- where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) using behavior cloning (BC) on a dataset of human demonstrations achieves promising results. However, this has limitations -- 1) BC policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present PIRLNav, a two-stage learning scheme for BC pretraining on human demonstrations followed by RL-finetuning. This leads to a policy that achieves a success rate of $65.0\%$ on ObjectNav ($+5.0\%$ absolute over previous state-of-the-art). Using this BC$\rightarrow$RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with `free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that BC$\rightarrow$RL on human demonstrations outperforms BC$\rightarrow$RL on SP and FE trajectories, even when controlled for same BC-pretraining success on train, and even on a subset of val episodes where BC-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the BC pretraining dataset. We find that as we increase the size of BC-pretraining dataset and get to high BC accuracies, improvements from RL-finetuning are smaller, and that $90\%$ of the performance of our best BC$\rightarrow$RL policy can be achieved with less than half the number of BC demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them.

翻译：我们研究目标导向导航问题——虚拟机器人需在新环境中导航至指定物体。先前研究表明，基于人类演示数据集的行为克隆（BC）方法可实现良好效果。然而该方法存在局限性：1）BC策略因仅模仿动作而非其后果，导致对新状态的泛化能力差；2）收集演示成本高昂。另一方面，强化学习（RL）虽具备天然可扩展性，但需要精心设计奖励函数才能获得理想行为。我们提出PIRLNav两阶段学习方案：先对人类演示进行行为克隆预训练，再通过强化学习微调。该策略在目标导航任务中实现$65.0\%$的成功率（较先前最优方法绝对提升$5.0\%$）。基于BC→RL训练范式，我们对设计选择展开严格实证分析。首先探究人类演示是否可被"免费"（自动生成）演示替代，例如最短路径（SP）或任务无关前沿探索（FE）轨迹。研究发现，即使控制相同BC预训练成功率，甚至在BC预训练成功率有利于SP/FE策略的验证子集上，基于人类演示的BC→RL方法仍优于SP和FE轨迹。其次研究RL微调性能如何随BC预训练数据集规模变化。我们发现，当增大BC预训练数据集规模并达到较高BC准确率时，RL微调带来的性能提升幅度减小，且最佳BC→RL策略的$90\%$性能可通过不到一半数量的BC演示实现。最后分析目标导航策略的失败模式，并提出进一步优化的指导原则。