We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.
翻译:我们提出UniScale,一个用于机器人应用的统一、尺度感知的多视角三维重建框架,通过模块化、语义感知的设计灵活集成几何先验。在基于视觉的机器人导航中,从原始图像序列中准确提取环境结构对下游任务至关重要。UniScale通过单一前馈网络应对这一挑战,该网络能够从多视角图像中联合估计相机内参和外参、尺度不变深度与点云图以及场景的度量尺度,并可在可用时选择性地融入辅助几何先验。通过将全局上下文推理与相机感知特征表示相结合,UniScale能够恢复场景的度量尺度。在相机内参已知的机器人应用场景中,可轻松整合该信息以提升性能;若相机位姿亦已知,则可获得额外增益。这种协同设计使得鲁棒的、度量感知的三维重建能够在单一统一模型中实现。重要的是,UniScale无需从头训练,且能利用预训练模型中展现的世界先验而无需几何编码策略,这使其特别适用于资源受限的机器人团队。我们在多个基准测试上评估UniScale,结果表明其在多样化环境中具有强大的泛化能力和稳定的性能。我们将在论文录用后公开实现代码。