Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
翻译:人类通过多感官协同学习抽象概念,一旦形成,此类表征通常可以从单一模态中回忆起来。受此原理启发,我们提出了Concerto,一个用于空间认知的人类概念学习的极简模拟,它结合了3D模态内自蒸馏与2D-3D跨模态联合嵌入。尽管结构简单,Concerto学习了更连贯且信息更丰富的空间特征,这通过零样本可视化得到了证明。在3D场景感知的线性探测任务中,它分别优于独立的SOTA 2D和3D自监督模型14.2%和4.8%,也优于它们的特征拼接。通过完全微调,Concerto在多个场景理解基准测试中(例如,在ScanNet上达到80.7% mIoU)创造了新的SOTA结果。我们进一步提出了一个专为视频提升点云空间理解定制的Concerto变体,以及一个将Concerto表征线性投影到CLIP语言空间的翻译器,从而实现了开放世界感知。这些结果凸显了Concerto涌现出的空间表征具有卓越的细粒度几何和语义一致性。