Underwater environments present unique challenges for robotic operation, including complex hydrodynamics, limited visibility, and constrained communication. Although data-driven approaches have advanced embodied intelligence in terrestrial robots and enabled task-specific autonomous underwater robots, developing underwater intelligence capable of autonomously performing multiple tasks remains highly challenging, as large-scale, high-quality underwater datasets are still scarce. To address these limitations, we introduce USIM, a simulation-based multi-task Vision-Language-Action (VLA) dataset for underwater robots. USIM comprises over 561K frames from 1,852 trajectories, totaling approximately 15.6 hours of BlueROV2 interactions across 20 tasks in 9 diverse scenarios, ranging from visual navigation to mobile manipulation. Building upon this dataset, we propose U0, a VLA model for general underwater robots, which integrates binocular vision and other sensor modalities through multimodal fusion, and further incorporates a convolution-attention-based perception focus enhancement module (CAP) to improve spatial understanding and mobile manipulation. Across tasks such as inspection, obstacle avoidance, scanning, and dynamic tracking, the framework achieves a success rate of 80%, while in challenging mobile manipulation tasks, it reduces the distance to the target by 21.2% compared with baseline methods, demonstrating its effectiveness. USIM and U0 show that VLA models can be effectively applied to underwater robotic applications, providing a foundation for scalable dataset construction, improved task autonomy, and the practical realization of intelligent general underwater robots.
翻译:水下环境为机器人操作带来独特挑战,包括复杂流体动力学、有限能见度及受限通信条件。尽管数据驱动方法已推动地面机器人具身智能的发展,并实现了面向特定任务的自主水下机器人,但开发能够自主执行多任务的水下智能体仍极具挑战,主要原因在于大规模、高质量的水下数据集依然稀缺。为突破这些限制,我们提出USIM——一个基于仿真的多任务视觉-语言-动作(VLA)水下机器人数据集。USIM包含来自1,852条轨迹的超过561K帧数据,总计约15.6小时的BlueROV2交互记录,涵盖9种不同场景中的20项任务,范围从视觉导航到移动操作。基于此数据集,我们进一步提出U0——面向通用水下机器人的VLA模型。该模型通过多模态融合集成双目视觉与其他传感器模态,并引入基于卷积-注意力的感知聚焦增强模块(CAP)以提升空间理解与移动操作能力。在检测、避障、扫描及动态跟踪等任务中,该框架实现了80%的成功率;在具有挑战性的移动操作任务中,相较于基线方法,其与目标距离平均缩短21.2%,验证了模型的有效性。USIM与U0表明VLA模型能够有效应用于水下机器人领域,为可扩展数据集构建、增强任务自主性及实现智能通用水下机器人提供了重要基础。