We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.
翻译:我们憧憬着这样一个未来:来自所有领域的点云能够汇聚一堂,共同构建一个使所有领域受益的统一模型。为实现这一目标,我们提出了Utonia,这是朝着跨多个领域训练单一自监督点Transformer编码器迈出的第一步,涵盖的领域包括遥感、室外激光雷达、室内RGB-D序列、以物体为中心的CAD模型,以及从纯RGB视频中提取的点云。尽管这些数据在感知几何结构、密度和先验分布上存在显著差异,Utonia仍能学习到一个跨领域一致的表示空间。这种统一不仅提升了感知能力,还揭示了仅在多领域联合训练时才会出现的、引人入胜的涌现行为。除了感知任务,我们还观察到Utonia的表示也能有益于具身智能和多模态推理:将视觉-语言-动作策略以Utonia特征为条件进行调节,可改进机器人操作任务;将其整合到视觉-语言模型中,则能提升空间推理性能。我们希望Utonia能够成为稀疏三维数据基础模型发展的一步,并为增强现实/虚拟现实、机器人学和自动驾驶等下游应用提供支持。