Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vehicles (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments and boasts ultra-low power consumption for long endurance.
翻译:近年来,视觉定位与多传感器配置已被纳入地面自动驾驶系统与无人水面艇(USVs)的感知系统中,然而基于现代学习的多传感器视觉定位模型的高复杂度阻碍了此类模型在实际无人水面艇上的部署。为此,我们设计了一种名为NanoMVG的低功耗多任务模型,用于水道具身感知,通过自然语言同时引导相机与4D毫米波雷达定位特定目标。NanoMVG能够同时执行框级与掩码级视觉定位任务。相比其他视觉定位模型,NanoMVG在WaterVG数据集上取得了极具竞争力的性能,尤其在恶劣环境中表现优异,并具备超低功耗特性以支持长时间续航。