Autonomous robot operation in unstructured environments is often underpinned by spatial understanding through vision. Systems composed of multiple concurrently operating robots additionally require access to frequent, accurate and reliable pose estimates. Classical vision-based methods to regress relative pose are commonly computationally expensive (precluding real-time applications), and often lack data-derived priors for resolving ambiguities. In this work, we propose CoViS-Net, a cooperative, multi-robot visual spatial foundation model that learns spatial priors from data, enabling pose estimation as well as general spatial comprehension. Our model is fully decentralized, platform-agnostic, executable in real-time using onboard compute, and does not require existing networking infrastructure. CoViS-Net provides relative pose estimates and a local bird's-eye-view (BEV) representation, even without camera overlap between robots, and can predict BEV representations of unseen regions. We demonstrate its use in a multi-robot formation control task across various real-world settings. We provide supplementary material online and will open source our trained model in due course. https://sites.google.com/view/covis-net
翻译:在非结构化环境中的自主机器人操作通常依赖于通过视觉实现的空间理解。由多个并发运行机器人组成的系统额外需要频繁、精确且可靠的位姿估计。传统的基于视觉的相对位姿回归方法通常计算成本高昂(阻碍了实时应用),且缺乏解决歧义性的数据驱动先验知识。本文提出CoViS-Net,一种协同多机器人视觉空间基础模型,能够从数据中学习空间先验,实现位姿估计及通用空间理解。该模型完全去中心化、平台无关,可利用机载计算实时运行,且无需现有网络基础设施。即使在机器人之间无相机重叠的情况下,CoViS-Net也能提供相对位姿估计和局部鸟瞰图(BEV)表示,并可预测未观测区域的BEV表示。我们展示了该模型在多种真实场景下的多机器人编队控制任务中的应用。在线提供补充材料,并将在适当时机开源训练模型。https://sites.google.com/view/covis-net