像素级完美视觉几何估计 (Pixel-Perfect Visual Geometry Estimation)

Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

翻译：从图像中恢复清晰准确的几何结构对于机器人与增强现实至关重要。然而，现有的几何基础模型仍严重受困于飞点伪影与细节丢失问题。本文提出像素级完美视觉几何模型，通过利用像素空间的生成式建模，能够预测高质量、无飞点的点云。我们首先介绍像素级完美深度估计模型——一种基于像素空间扩散Transformer构建的单目深度基础模型。为应对像素空间扩散伴随的高计算复杂度，我们提出两项关键设计：1）语义引导扩散Transformer，通过融入视觉基础模型提取的语义表征来引导扩散过程，在保持全局语义的同时增强细粒度视觉细节；2）级联扩散Transformer架构，逐步增加图像令牌数量，从而同步提升效率与精度。为将PPD进一步扩展至视频领域，我们提出新型语义一致扩散Transformer，其从多视角几何基础模型中提取时序一致的语义表征，随后在扩散Transformer内部执行参考引导的令牌传播机制，以最小计算与内存开销保持时序连贯性。我们的模型在所有生成式单目与视频深度估计模型中取得最优性能，且生成的点云显著优于其他所有模型。

相关内容

点云

关注 50

根据激光测量原理得到的点云，包括三维坐标（XYZ）和激光反射强度（Intensity）。根据摄影测量原理得到的点云，包括三维坐标（XYZ）和颜色信息（RGB）。结合激光测量和摄影测量原理得到点云，包括三维坐标（XYZ）、激光反射强度（Intensity）和颜色信息（RGB）。在获取物体表面每个采样点的空间坐标后，得到的是一个点的集合，称之为“点云”(Point Cloud)

【NeurIPS2025】语义提示扩散变换器的像素级精确深度估计

专知会员服务

8+阅读 · 2025年10月9日

迈向深度基础模型：基于视觉的深度估计最新趋势

专知会员服务

23+阅读 · 2025年7月16日

【斯坦福博士论文】用于视觉理解及其扩展的几何深度表示

专知会员服务

16+阅读 · 2025年6月8日

基于深度学习的物体姿态估计综述

专知会员服务

26+阅读 · 2024年5月15日