Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. The overview and demos are available on our project page: https://point-map.github.io/Point-Map/
翻译:机器人操作系统得益于互补的感知模态,其中每种模态都能提供独特的环境信息。点云捕捉精细的几何结构,而RGB图像则提供丰富的语义上下文。当前点云方法难以捕捉细粒度细节(尤其在复杂任务中),而RGB方法则缺乏几何感知能力,这限制了其精度与泛化性能。本文提出PointMapPolicy,一种基于结构化点网格(无需下采样)的扩散策略新方法。所生成的数据类型便于从观测中提取形状与空间关系,并可在参考坐标系间进行变换。得益于其在规则网格中的结构特性,我们能够将成熟的计算机视觉技术直接应用于三维数据。以xLSTM为骨干网络,我们的模型能高效融合点云图与RGB数据以增强多模态感知能力。通过在RoboCasa与CALVIN基准测试中的大量实验及真实机器人验证,我们证明该方法在多样化操作任务中均达到最先进的性能水平。项目概览与演示视频详见项目页面:https://point-map.github.io/Point-Map/