Hand gesture recognition allows humans to interact with machines non-verbally, which has a huge application in underwater exploration using autonomous underwater vehicles. Recently, a new gesture-based language called CADDIAN has been devised for divers, and supervised learning methods have been applied to recognize the gestures with high accuracy. However, such methods fail when they encounter unseen gestures in real time. In this work, we advocate the need for zero-shot underwater gesture recognition (ZSUGR), where the objective is to train a model with visual samples of gestures from a few ``seen'' classes only and transfer the gained knowledge at test time to recognize semantically-similar unseen gesture classes as well. After discussing the problem and dataset-specific challenges, we propose new seen-unseen splits for gesture classes in CADDY dataset. Then, we present a two-stage framework, where a novel transformer learns strong visual gesture cues and feeds them to a conditional generative adversarial network that learns to mimic feature distribution. We use the trained generator as a feature synthesizer for unseen classes, enabling zero-shot learning. Extensive experiments demonstrate that our method outperforms the existing zero-shot techniques. We conclude by providing useful insights into our framework and suggesting directions for future research.
翻译:手势识别使人类能够以非语言方式与机器交互,这在利用自主水下航行器进行水下探索方面具有巨大应用前景。近期,一种名为CADDIAN的新型手势语言被开发用于潜水员,监督学习方法已被应用于高精度识别这些手势。然而,当这些方法在实时场景中遇到未见手势时便会失效。本研究主张发展零样本水下手势识别,其目标在于仅使用少量“已见”类别的手势视觉样本训练模型,并在测试时迁移所学知识以识别语义相似的未见手势类别。在探讨该问题及数据集特定挑战后,我们为CADDY数据集中的手势类别提出了新的已见-未见划分方案。随后,我们提出一个两阶段框架:新型Transformer网络学习强视觉手势特征,并将其馈送至条件生成对抗网络以学习特征分布模拟。我们使用训练好的生成器作为未见类别的特征合成器,从而实现零样本学习。大量实验表明,本方法优于现有零样本技术。最后,我们深入解析了框架机理,并为未来研究方向提出了建议。