Underwater Object Tracking (UOT) is crucial for efficient marine robotics, large scale ecological monitoring, and ocean exploration; however, progress has been hindered by the scarcity of large, multimodal, and diverse datasets. Existing benchmarks remain small and RGB only, limiting robustness under severe color distortion, turbidity, and low visibility conditions. We introduce MUOT_3M, the first pseudo multimodal UOT benchmark comprising 3 million frames from 3,030 videos (27.8h) annotated with 32 tracking attributes, 677 fine grained classes, and synchronized RGB, estimated enhanced RGB, estimated depth, and language modalities validated by a marine biologist. Building upon MUOT_3M, we propose MUTrack, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model. Extensive evaluations across five UOT benchmarks demonstrate that MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than the strongest SOTA baselines while running at 24 FPS. MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.
翻译:水下目标跟踪对于高效海洋机器人技术、大规模生态监测和海洋勘探至关重要;然而,大型、多模态且多样化的数据集的稀缺阻碍了该领域的进展。现有基准数据集规模小且仅包含RGB模态,限制了其在严重颜色失真、浑浊和低能见度条件下的鲁棒性。我们提出了MUOT_3M,这是首个伪多模态水下目标跟踪基准数据集,包含来自3,030个视频(总计27.8小时)的300万帧图像,标注了32个跟踪属性、677个细粒度类别,以及经过海洋生物学家验证的同步RGB模态、估计增强RGB模态、估计深度模态和语言模态。基于MUOT_3M,我们提出了MUTrack,一种基于SAM的多模态到单模态跟踪器,其特点包括视觉几何对齐、视觉-语言融合以及四级知识蒸馏,可将多模态知识迁移到单模态学生模型中。在五个水下目标跟踪基准数据集上的广泛评估表明,MUTrack相比最强的现有技术基线,实现了高达8.40%的AUC提升和7.80%的精确度提升,同时运行速度达到24 FPS。MUOT_3M和MUTrack为可扩展、多模态训练且实际可部署的水下跟踪研究奠定了新的基础。