ReMoBot: Retrieval-Based Few-Shot Imitation Learning for Mobile Manipulation with Vision Foundation Models

Imitation learning (IL) algorithms typically distill demonstrations into parametric policies to mimic expert behavior. However, with limited data and partial observability, such as in egocentric mobile manipulation, existing methods often struggle to generate accurate actions. To address these challenges, we propose ReMoBot, a few-shot, trajectory-conditioned imitation learning framework that directly Retrieves information from demonstrations to solve Mobile manipulation tasks with ego-centric visual observations. Leveraging vision foundation models, ReMoBot identifies relevant expert demonstrations by combining state-level similarity, history-aware trajectory alignment, and action-sequence consistency to disambiguate perceptually similar observations. The agent then selects appropriate control commands based on these retrieved demonstrations in a fully training-free manner. We evaluate ReMoBot on three mobile manipulation tasks using a Boston Dynamics Spot robot in both simulation and real-world settings. After benchmarking five approaches in simulation, we compare our method with two baselines trained directly on real-world data without sim-to-real transfer. With only 20 demonstrations per task, ReMoBot outperforms the baselines, achieving high success rates in Table Uncover (70%) and Gap Cover (80%), while also showing promising performance on the more challenging Curtain Open task in the real-world setting. Furthermore, ReMoBot generalizes across varying robot positions, object sizes, and material properties, highlighting its robustness in real-world deformable mobile manipulation. Additional details are available at: https://sites.google.com/view/remobot/home

翻译：模仿学习算法通常将演示内容提炼为参数化策略来模仿专家行为。然而，在数据有限且部分可观测的场景中（如自我中心视角的移动操作任务），现有方法往往难以生成精确动作。为解决这些挑战，我们提出ReMoBot——一种少样本、轨迹条件化的模仿学习框架，可直接从演示中检索信息，通过自我中心视觉观测完成移动操作任务。借助视觉基础模型，ReMoBot通过结合状态级相似性、历史感知轨迹对齐及动作序列一致性来识别相关专家演示，从而消除感知上相似的观测歧义。随后，智能体基于这些检索到的演示，以完全无需训练的方式选择适当的控制指令。我们在仿真与真实环境中使用波士顿动力Spot机器人对三个移动操作任务评估了ReMoBot。在仿真中基准测试五种方法后，我们将该方法与两个直接基于真实数据训练（无模拟到现实迁移）的基线模型进行了比较。在每任务仅20个演示的条件下，ReMoBot在"掀桌子"（70%）和"盖缝隙"（80%）任务中取得高成功率，并在更具挑战性的真实环境"开门帘"任务中展现出优异性能。此外，ReMoBot能够泛化至不同机器人位置、物体尺寸及材料属性，验证了其在真实可变形移动操作中的鲁棒性。更多细节参见：https://sites.google.com/view/remobot/home