In recent years, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model, RoboMM, along with the comprehensive dataset, RoboData. RoboMM enhances 3D perception through camera parameters and occupancy supervision. Building on OpenFlamingo, it incorporates Modality-Isolation-Mask and multimodal decoder blocks, improving modality fusion and fine-grained perception. RoboData offers the complete evaluation system by integrating several well-known datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, and actions, and the space alignment facilitates comprehensive learning from diverse robotic datasets. Equipped with RoboData and the unified physical space, RoboMM is the generalist policy that enables simultaneous evaluation across all tasks within multiple datasets, rather than focusing on limited selection of data or tasks. Its design significantly enhances robotic manipulation performance, increasing the average sequence length on the CALVIN from 1.7 to 3.3 and ensuring cross-embodiment capabilities, achieving state-of-the-art results across multiple datasets.
翻译:近年来,机器人学通过集成更大规模的模型与数据集取得了显著进展。然而,将这些模型应用于三维空间交互并控制数据采集成本方面仍存在挑战。为解决这些问题,我们提出了多模态机器人操作模型RoboMM以及综合性数据集RoboData。RoboMM通过相机参数与占据监督增强三维感知能力。基于OpenFlamingo架构,该模型引入模态隔离掩码与多模态解码器模块,有效提升了模态融合与细粒度感知性能。RoboData通过整合多个知名数据集构建了完整的评估体系,首次实现了多视角图像、相机参数、深度图与动作数据的融合,其空间对齐机制促进了从多样化机器人数据中进行全面学习。依托RoboData与统一物理空间,RoboMM作为通用策略模型能够同时在多个数据集的所有任务上进行评估,而非局限于有限的数据或任务选择。该设计显著提升了机器人操作性能,将CALVIN数据集上的平均序列长度从1.7提升至3.3,并确保跨具身操作能力,在多个数据集上取得了最先进的性能。