Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. The overview and demos are available on our project page: https://point-map.github.io/Point-Map/
翻译:机器人操作系统受益于互补的感知模态,每种模态都能提供独特的环境信息。点云捕捉精细的几何结构,而RGB图像则提供丰富的语义上下文。当前点云方法难以捕捉细粒度细节(尤其在复杂任务中),而RGB方法缺乏几何感知能力,这限制了其精度与泛化性能。我们提出点图策略,这是一种在结构化点网格上构建扩散策略的新方法,无需进行下采样。生成的数据类型便于从观测中提取形状与空间关系,并可在参考坐标系间转换。得益于其在规则网格中的结构特性,我们能够将成熟的计算机视觉技术直接应用于三维数据。以xLSTM为骨干网络,我们的模型高效融合点图与RGB数据,实现了增强的多模态感知能力。通过在RoboCasa与CALVIN基准测试中的大量实验及真实机器人验证,我们证明该方法在多样化操作任务中均达到了最先进的性能水平。项目概览与演示视频详见项目页面:https://point-map.github.io/Point-Map/