Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task, generalist policy by training broadly across them. Notably, while these generalist policies can improve the average performance across many tasks, the performance of generalist policies on any one task is often suboptimal due to negative transfer between partitions of the data, compared to task-specific specialist policies. In this work, we argue for the paradigm of training policies during deployment given the scenarios they encounter: rather than deploying pre-trained policies to unseen problems in a zero-shot manner, we non-parametrically retrieve and train models directly on relevant data at test time. Furthermore, we show that many robotics tasks share considerable amounts of low-level behaviors and that retrieval at the "sub"-trajectory granularity enables significantly improved data utilization, generalization, and robustness in adapting policies to novel problems. In contrast, existing full-trajectory retrieval methods tend to underutilize the data and miss out on shared cross-task content. This work proposes STRAP, a technique for leveraging pre-trained vision foundation models and dynamic time warping to retrieve sub-sequences of trajectories from large training corpora in a robust fashion. STRAP outperforms both prior retrieval algorithms and multi-task learning methods in simulated and real experiments, showing the ability to scale to much larger offline datasets in the real world as well as the ability to learn robust control policies with just a handful of real-world demonstrations.
翻译:机器人学习领域正经历着预收集数据集在规模、多样性和复杂性方面的显著增长,这反映了自然语言处理和计算机视觉等领域的趋势。许多机器人学习方法将此类数据集视为多任务专家数据,并通过广泛训练来学习一个多任务、通用型策略。值得注意的是,尽管这些通用策略可以提高许多任务的平均性能,但由于数据分区之间的负迁移效应,通用策略在任何单一任务上的性能通常逊于针对特定任务的专家策略。在本工作中,我们主张在部署过程中根据策略所遇到的场景进行训练这一范式:我们并非以零样本方式将预训练策略部署到未见问题上,而是在测试时以非参数化方式直接从相关数据中检索并训练模型。此外,我们证明许多机器人任务共享大量低层行为,并且在"子"轨迹粒度上进行检索能够显著提高数据利用率、泛化能力和鲁棒性,从而将策略适应到新问题上。相比之下,现有的全轨迹检索方法往往未能充分利用数据,并错过了跨任务共享的内容。本研究提出了STRAP,这是一种利用预训练视觉基础模型和动态时间规整技术,以鲁棒的方式从大型训练语料库中检索轨迹子序列的方法。STRAP在仿真和真实实验中均优于先前的检索算法和多任务学习方法,展示了其能够扩展到现实世界中更大的离线数据集的能力,以及仅需少量真实世界演示即可学习鲁棒控制策略的能力。