Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.
翻译:现有视频深度估计方法面临一个根本性权衡:生成模型易受随机几何幻觉和尺度漂移影响,而判别模型需要海量标注数据来解决语义歧义。为突破这一困境,我们提出DVD——首个将预训练视频扩散模型确定性适配为单次深度回归器的框架。具体而言,DVD包含三项核心设计:(i)将扩散时间步重新定义为结构锚点,以平衡全局稳定性与高频细节;(ii)潜在流形校正(LMR)机制缓解回归导致的过度平滑,通过微分约束恢复清晰边界与连贯运动;(iii)全局仿射一致性这一固有特性可约束窗口间差异,实现无需复杂时序对齐的长视频无缝推理。大量实验表明,DVD在跨基准测试中实现了最先进的零样本性能。此外,DVD仅使用领先基线方法1/163的任务特定数据,成功解锁了视频基础模型中隐含的深层几何先验。值得注意的是,我们完整开源了训练管线,为开源社区提供整套达到最先进水平的视频深度估计训练套件。