Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.
翻译:利用不同空间和时间分辨率的感知模态可提升机器人操作任务的性能。多空间分辨率感知能捕获不同空间尺度的层级化信息,同时支持粗粒度与精细运动控制。而多时间分辨率感知则使智能体具备高反应性与实时控制能力。本文提出一种名为MResT(多分辨率Transformer)的框架,用于学习可泛化的语言条件多任务策略。该框架通过使用不同容量的网络,融合不同空间与时间分辨率的感知信息,有效实现对精密与高反应性任务的实时控制。我们利用现成的预训练视觉-语言模型处理低频全局特征,同时结合小型非预训练模型适应高频局部反馈。在粗粒度、精密与动态操作三类任务的广泛实验表明,本方法较近期多任务基线模型平均性能提升2倍。此外,该方法能良好泛化至目标物体的视觉与几何变化,以及交互力的动态差异。