Gesture is an important mean of non-verbal communication, with visual modality allows human to convey information during interaction, facilitating peoples and human-machine interactions. However, it is considered difficult to automatically recognise gestures. In this work, we explore three different means to recognise hand signs using deep learning: supervised learning based methods, self-supervised methods and visualisation based techniques applied to 3D moving skeleton data. Self-supervised learning used to train fully connected, CNN and LSTM method. Then, reconstruction method is applied to unlabelled data in simulated settings using CNN as a backbone where we use the learnt features to perform the prediction in the remaining labelled data. Lastly, Grad-CAM is applied to discover the focus of the models. Our experiments results show that supervised learning method is capable to recognise gesture accurately, with self-supervised learning increasing the accuracy in simulated settings. Finally, Grad-CAM visualisation shows that indeed the models focus on relevant skeleton joints on the associated gesture.
翻译:手势是非语言交流的重要手段,视觉模态使人类能够在交互过程中传递信息,促进人际及人机交互。然而,手势的自动识别被认为具有挑战性。本研究探索了三种基于深度学习的手势识别方法:基于监督学习的方法、自监督学习方法以及应用于三维动态骨架数据的可视化技术。自监督学习用于训练全连接网络、CNN和LSTM方法。随后,在仿真环境中以CNN为骨干网络对未标注数据应用重构方法,利用学习到的特征在剩余标注数据上进行预测。最后,应用Grad-CAM技术以揭示模型的关注区域。实验结果表明:监督学习方法能够准确识别手势,而自监督学习在仿真环境中能进一步提升准确率;Grad-CAM可视化分析证实模型确实聚焦于相关手势对应的关键骨架关节点。