Utonia：迈向适用于所有点云的统一编码器 (Utonia: Toward One Encoder for All Point Clouds)

We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.

翻译：我们憧憬着这样一个未来：来自所有领域的点云能够汇聚一堂，共同构建一个使所有领域受益的统一模型。为实现这一目标，我们提出了Utonia，这是朝着跨多个领域训练单一自监督点Transformer编码器迈出的第一步，涵盖的领域包括遥感、室外激光雷达、室内RGB-D序列、以物体为中心的CAD模型，以及从纯RGB视频中提取的点云。尽管这些数据在感知几何结构、密度和先验分布上存在显著差异，Utonia仍能学习到一个跨领域一致的表示空间。这种统一不仅提升了感知能力，还揭示了仅在多领域联合训练时才会出现的、引人入胜的涌现行为。除了感知任务，我们还观察到Utonia的表示也能有益于具身智能和多模态推理：将视觉-语言-动作策略以Utonia特征为条件进行调节，可改进机器人操作任务；将其整合到视觉-语言模型中，则能提升空间推理性能。我们希望Utonia能够成为稀疏三维数据基础模型发展的一步，并为增强现实/虚拟现实、机器人学和自动驾驶等下游应用提供支持。

相关内容

点云

关注 50

根据激光测量原理得到的点云，包括三维坐标（XYZ）和激光反射强度（Intensity）。根据摄影测量原理得到的点云，包括三维坐标（XYZ）和颜色信息（RGB）。结合激光测量和摄影测量原理得到点云，包括三维坐标（XYZ）、激光反射强度（Intensity）和颜色信息（RGB）。在获取物体表面每个采样点的空间坐标后，得到的是一个点的集合，称之为“点云”(Point Cloud)

【CVPR2026】DiverseDiT: 迈向扩散 Transformer 中的多样化表示学习

专知会员服务

8+阅读 · 3月9日

VILA-U：一个融合视觉理解与生成的统一基础模型

专知会员服务

21+阅读 · 2024年9月9日

【CVPR2024】GroupContrast：语义感知的自监督表示学习用于三维理解

专知会员服务

18+阅读 · 2024年3月15日

【TPAMI2023】PSLT：一种带有梯形自注意力和逐步位移的轻量级视觉Transformer

专知会员服务

26+阅读 · 2023年9月4日