From Editor to Dense Geometry Estimator

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

翻译：利用预训练文本到图像生成模型中的视觉先验已在密集预测任务中展现出优势。然而密集预测本质上是一项图像到图像的任务，这表明图像编辑模型而非文本到图像生成模型，可能是更合适的微调基础。受此启发，我们系统分析了编辑器和生成器在密集几何估计中的微调行为。研究发现，编辑模型具备固有的结构先验，能够通过"精炼"其内在特征更稳定地收敛，最终实现优于生成模型的性能。基于这些发现，我们提出FE2E框架，该框架开创性地将基于扩散Transformer架构的先进编辑模型适配到密集几何预测中。具体而言，为将此编辑器适配至确定性任务，我们将其原始流匹配损失重构为"一致速度"训练目标；同时采用对数量化解决编辑器原生BFloat16格式与任务高精度需求间的精度冲突。此外，我们利用扩散Transformer的全局注意力机制，在单次前向传播中实现深度与法线的联合估计，使两者的监督信号相互增强。在无需扩展训练数据的情况下，FE2E在零样本单目深度与法线估计任务中展现出显著性能提升——尤其在ETH3D数据集上获得超过35%的性能增益，并超越基于100倍数据量训练的DepthAnything系列。项目页面可通过此链接访问。