Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.
翻译:从图像中恢复清晰准确的几何结构对于机器人与增强现实至关重要。然而,现有的几何基础模型仍严重受困于飞点伪影与细节丢失问题。本文提出像素级完美视觉几何模型,通过利用像素空间的生成式建模,能够预测高质量、无飞点的点云。我们首先介绍像素级完美深度估计模型——一种基于像素空间扩散Transformer构建的单目深度基础模型。为应对像素空间扩散伴随的高计算复杂度,我们提出两项关键设计:1)语义引导扩散Transformer,通过融入视觉基础模型提取的语义表征来引导扩散过程,在保持全局语义的同时增强细粒度视觉细节;2)级联扩散Transformer架构,逐步增加图像令牌数量,从而同步提升效率与精度。为将PPD进一步扩展至视频领域,我们提出新型语义一致扩散Transformer,其从多视角几何基础模型中提取时序一致的语义表征,随后在扩散Transformer内部执行参考引导的令牌传播机制,以最小计算与内存开销保持时序连贯性。我们的模型在所有生成式单目与视频深度估计模型中取得最优性能,且生成的点云显著优于其他所有模型。