Current discriminative depth estimation methods often produce blurry artifacts, while generative approaches suffer from slow sampling due to curvatures in the noise-to-depth transport. Our method addresses these challenges by framing depth estimation as a direct transport between image and depth distributions. We are the first to explore flow matching in this field, and we demonstrate that its interpolation trajectories enhance both training and sampling efficiency while preserving high performance. While generative models typically require extensive training data, we mitigate this dependency by integrating external knowledge from a pre-trained image diffusion model, enabling effective transfer even across differing objectives. To further boost our model performance, we employ synthetic data and utilize image-depth pairs generated by a discriminative model on an in-the-wild image dataset. As a generative model, our model can reliably estimate depth confidence, which provides an additional advantage. Our approach achieves competitive zero-shot performance on standard benchmarks of complex natural scenes while improving sampling efficiency and only requiring minimal synthetic data for training.
翻译:当前判别式深度估计方法常产生模糊伪影,而生成式方法则因噪声到深度传输过程中的曲率问题导致采样速度缓慢。本方法通过将深度估计构建为图像分布与深度分布间的直接传输来解决这些挑战。我们首次在该领域探索流匹配技术,并证明其插值轨迹能在保持高性能的同时提升训练与采样效率。虽然生成模型通常需要大量训练数据,但我们通过整合预训练图像扩散模型的外部知识来缓解这种依赖性,即使在不同目标间也能实现有效迁移。为进一步提升模型性能,我们采用合成数据并利用判别式模型在真实场景图像数据集上生成的图像-深度对。作为生成模型,我们的方法能够可靠估计深度置信度,这提供了额外优势。本方法在复杂自然场景标准基准测试中实现了具有竞争力的零样本性能,同时提升了采样效率,且仅需少量合成数据进行训练。