Autonomous 3D scanning of open-world target structures via drones remains challenging despite broad applications. Existing paradigms rely on restrictive assumptions or effortful human priors, limiting practicality, efficiency, and adaptability. Recent foundation models (FMs) offer great potential to bridge this gap. This paper investigates a critical research problem: What system architecture can effectively integrate FM knowledge for this task? We answer it with FlyCo, a principled FM-empowered perception-prediction-planning loop enabling fully autonomous, prompt-driven 3D target scanning in diverse unknown open-world environments. FlyCo directly translates low-effort human prompts (text, visual annotations) into precise adaptive scanning flights via three coordinated stages: (1) perception fuses streaming sensor data with vision-language FMs for robust target grounding and tracking; (2) prediction distills FM knowledge and combines multi-modal cues to infer the partially observed target's complete geometry; (3) planning leverages predictive foresight to generate efficient and safe paths with comprehensive target coverage. Building on this, we further design key components to boost open-world target grounding efficiency and robustness, enhance prediction quality in terms of shape accuracy, zero-shot generalization, and temporal stability, and balance long-horizon flight efficiency with real-time computability and online collision avoidance. Extensive challenging real-world and simulation experiments show FlyCo delivers precise scene understanding, high efficiency, and real-time safety, outperforming existing paradigms with lower human effort and verifying the proposed architecture's practicality. Comprehensive ablations validate each component's contribution. FlyCo also serves as a flexible, extensible blueprint, readily leveraging future FM and robotics advances. Code will be released.
翻译:尽管应用广泛,通过无人机对开放世界目标结构进行自主三维扫描仍具挑战性。现有范式依赖于限制性假设或费时费力的人工先验,限制了其实用性、效率与适应性。近期的基础模型展现出弥合这一差距的巨大潜力。本文研究一个关键科学问题:何种系统架构能有效整合基础模型知识以完成此任务?我们通过FlyCo给出答案,这是一个基于原则、由基础模型驱动的感知-预测-规划闭环系统,能够在多样未知的开放世界环境中实现完全自主、提示驱动的三维目标扫描。FlyCo通过三个协同阶段,将低人工投入的提示(文本、视觉标注)直接转化为精确的自适应扫描飞行:(1) 感知阶段融合流式传感器数据与视觉-语言基础模型,实现鲁棒的目标定位与跟踪;(2) 预测阶段提炼基础模型知识并融合多模态线索,以推断部分观测目标的完整几何结构;(3) 规划阶段利用预测前瞻性,生成高效、安全且能全面覆盖目标的路径。在此基础上,我们进一步设计关键组件以提升开放世界目标定位的效率和鲁棒性,在形状精度、零样本泛化能力和时序稳定性方面增强预测质量,并平衡长时程飞行效率与实时计算能力及在线避障需求。大量具有挑战性的真实世界与仿真实验表明,FlyCo能实现精确的场景理解、高效率和实时安全性,以更低的人工投入超越现有范式,验证了所提架构的实用性。全面的消融实验验证了各组件的贡献。FlyCo还可作为一个灵活、可扩展的蓝图,便于利用未来基础模型与机器人技术的进展。代码将公开发布。