Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.
翻译:单目深度估计仍面临挑战,因为近期的基础模型(如Depth Anything V2 (DA-V2))在处理与训练分布差异较大的真实世界图像时表现不佳。本文提出Re-Depth Anything,一种测试时自监督框架,通过融合DA-V2与大规模二维扩散模型的强大先验知识来弥合领域差距。该方法通过重照明预测的深度图并增强输入图像,直接在输入图像上实现无标签优化。这种重合成方法利用生成式上下文中的明暗恢复形状(SfS)线索,结合分数蒸馏采样(SDS),替代了传统的光度重建方法。为防止优化崩溃,本框架采用定向优化策略:不直接优化深度或微调完整模型,而是冻结编码器,仅更新中间嵌入表示并微调解码器。在多样化基准测试中,Re-Depth Anything相比DA-V2在深度精度和真实感方面均取得显著提升,为通过增强几何推理实现自监督开辟了新途径。