We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.
翻译:我们提出了SS3D,一种基于运动恢复结构(SfM)的网络规模自监督预训练流程,用于从单目视频进行前馈三维估计。该模型在单次前向传播中联合预测深度、自运动与相机内参,并作为连贯的端到端三维估计器进行训练与评估。为稳定联合学习,我们采用内参优先的两阶段调度策略和统一单检查点评估协议。将SfM自监督扩展至无约束网络视频面临弱多视角可观性与强语料异质性的挑战;我们通过多视角信号代理(MVS)实现筛选与课程采样,并通过专家训练蒸馏至单一学生模型来解决这些问题。在YouTube-8M(过滤后约1亿帧)上进行的预训练展现了强跨域零样本迁移能力,并在微调性能上优于此前自监督基线。我们开源了预训练检查点与代码。