Depth sensors are widely deployed across robotic platforms, and advances in fast, high-fidelity depth simulation have enabled robotic policies trained on depth observations to achieve robust sim-to-real transfer for a wide range of tasks. Despite this, representation learning for depth modality remains underexplored compared to RGB, where large-scale foundation models now define the state of the art. To address this gap, we present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications. Using a DINO-style self-distillation objective on a curated dataset of 60M depth images, DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors. To retain metric awareness across multiple scales, we introduce a novel input normalization strategy. We further distill DeFM into compact models suitable for resource-constrained robotic systems. When evaluated on depth-based classification, segmentation, navigation, locomotion, and manipulation benchmarks, DeFM achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments. We release all our pretrained models, which can be adopted off-the-shelf for depth-based robotic learning without task-specific fine-tuning. Webpage: https://de-fm.github.io/
翻译:深度传感器已在各类机器人平台上广泛应用,快速高保真深度模拟技术的进步使得基于深度观测训练的机器人策略能够在广泛任务中实现鲁棒的仿真到现实迁移。尽管如此,与已由大规模基础模型定义技术前沿的RGB模态相比,深度模态的表征学习仍处于探索不足的状态。为填补这一空白,我们提出DeFM——一个完全基于深度图像训练的自监督基础模型,专为机器人应用设计。通过在6000万张深度图像组成的精选数据集上采用DINO风格的自蒸馏目标,DeFM能够学习可泛化至不同环境、任务和传感器的几何与语义表征。为在多尺度下保持度量感知能力,我们提出了一种新颖的输入归一化策略。我们进一步将DeFM蒸馏为适用于资源受限机器人系统的紧凑模型。在基于深度信息的分类、分割、导航、运动与操作基准测试中,DeFM均取得最先进的性能,并展现出从仿真环境到现实场景的强泛化能力。我们开源所有预训练模型,这些模型可直接用于基于深度信息的机器人学习而无需任务特定微调。项目主页:https://de-fm.github.io/