Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.
翻译:有效的多人类-机器人协作对于在具有挑战性和高风险的深海环境中扩展人类主导操作至关重要。要使自主水下机器人(AUV)成为真正的队友,它们必须能够理解周围环境并识别潜水员的活动,以提供协助并确保安全。为实现这一目标,我们提出了DAR-Net,一种新颖的基于Transformer的框架,用于分析复杂水下场景并分类潜水员活动。我们的贡献在于一种语义引导的学习方法,它将基于Transformer的时间推理与像素级场景监督相结合。这种多损失训练策略明确地将全局活动识别与局部人机交互语义对齐,这在低能见度水下条件下尤为关键。为应对该领域数据稀缺的重大挑战,我们首次提出了水下潜水员活动(UDA)数据集,这是一个包含超过2600张带有像素级掩码的标注图像的基础资源。通过在受控环境中进行严格的实验评估,我们证明了DAR-Net在识别六种不同潜水员活动方面达到了令人满意的准确性,并超越了最先进的模型。尽管该数据集提供了关键基线,我们的工作作为一个开创性步骤,为未来研究奠定了基础,并促进了更智能、协作性水下机器人系统的开发。