Cooperative visual semantic navigation is a foundational capability for aerial robot teams operating in unknown environments. However, achieving robust open-vocabulary object-goal navigation remains challenging due to the computational constraints of deploying heavy perception models onboard and the complexity of decentralized multi-agent coordination. We present GoalSwarm, a fully decentralized multi-UAV framework for zero-shot semantic object-goal navigation. Each UAV collaboratively constructs a shared, lightweight 2D top-down semantic occupancy map by projecting depth observations from aerial vantage points, eliminating the computational burden of full 3D representations while preserving essential geometric and semantic structure. The core contributions of GoalSwarm are threefold: (1) integration of zero-shot foundation model -- SAM3 for open vocabulary detection and pixel-level segmentation, enabling open-vocabulary target identification without task-specific training; (2) a Bayesian Value Map that fuses multi-viewpoint detection confidences into a per-pixel goal-relevance distribution, enabling informed frontier scoring via Upper Confidence Bound (UCB) exploration; and (3) a decentralized coordination strategy combining semantic frontier extraction, cost-utility bidding with geodesic path costs, and spatial separation penalties to minimize redundant exploration across the swarm.
翻译:协作视觉语义导航是空中机器人团队在未知环境中运行的基础能力。然而,由于在机载设备上部署重型感知模型的计算限制以及去中心化多智能体协调的复杂性,实现鲁棒的开放词汇目标物体导航仍然具有挑战性。本文提出GoalSwarm,一个用于零样本语义目标物体导航的完全去中心化多无人机框架。每架无人机通过从空中有利位置投影深度观测,协作构建一个共享的轻量级二维俯视语义占据地图,在保留基本几何和语义结构的同时,消除了完整三维表示的计算负担。GoalSwarm的核心贡献有三方面:(1) 集成零样本基础模型——SAM3用于开放词汇检测和像素级分割,无需任务特定训练即可实现开放词汇目标识别;(2) 一种贝叶斯价值地图,将多视角检测置信度融合为每个像素的目标相关度分布,从而通过上置信界(UCB)探索实现有信息依据的前沿区域评分;(3) 一种去中心化协调策略,结合语义前沿提取、基于测地线路径成本的成本-效用投标以及空间分离惩罚,以最小化整个集群中的冗余探索。