The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.
翻译:数字孪生的自动化构建与精确资产盘点是智慧城市建设和设施全生命周期管理中的关键任务。然而,利用经济高效的稀疏图像仍面临诸多挑战,包括鲁棒性不足、定位不准确以及缺乏细粒度状态理解。为应对这些局限,本文提出了SVII-3D——一个用于资产整体数字化的统一框架。首先,通过融合LoRA微调的开集检测与空间注意力匹配网络,实现了稀疏视角间观测结果的鲁棒关联。其次,引入几何引导的优化机制以消除结构误差,实现了精确至分米级的三维定位。第三,超越静态几何映射,集成了一种利用多模态提示的视觉-语言模型智能体,以自动诊断细粒度的运行状态。实验表明,SVII-3D显著提升了识别精度并最小化了定位误差。因此,该框架为高保真基础设施数字化提供了一个可扩展、经济高效的解决方案,有效弥合了稀疏感知与自动化智能维护之间的鸿沟。