Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.
翻译:从超高分辨率无人机影像中实时重建深度是灾害响应等时间关键型地理空间任务的基础,但由于宽基线视差、大图像尺寸、低纹理或镜面反射表面、遮挡以及严格的计算约束,仍面临挑战。近年来,零样本扩散模型无需针对特定任务重新训练即可快速生成逐像素密度预测,且相比基于Transformer的预测器需要更少的标记数据集,同时避免了经典多视图立体视觉的刚性采集几何要求。然而,其概率推理方式导致在顺序帧和重叠瓦片间无法保证可靠的度量精度和时间一致性。我们提出ZeD-MAP,一种集群级框架,通过集成增量式集群束调整(BA),将测试阶段的扩散深度模型转换为度量一致的类SLAM映射流水线。流式无人机帧被分组为重叠集群;周期性BA生成度量一致的位姿和稀疏三维连接点,这些点被重投影到选定帧中,并作为扩散深度估计的度量指导。在约50米高度(地面采样间隔约0.85厘米/像素,对应每帧约2,650平方米地面覆盖)采集的地面标识飞行数据上,使用DLR模块化航空相机系统(MACS)验证显示,该方法实现了亚米级精度——水平(XY)平面误差约0.87米,垂直(Z)方向误差约0.12米,同时保持每幅图像运行时间在1.47至4.91秒之间。结果受手工点云标注引入的轻微噪声影响。这些发现表明,基于BA的度量引导能提供与经典摄影测量方法相当的精度,同时显著加速处理,从而支持实时三维地图生成。