In natural language processing (NLP) of spoken languages, word embeddings have been shown to be a useful method to encode the meaning of words. Sign languages are visual languages, which require sign embeddings to capture the visual and linguistic semantics of sign. Unlike many common approaches to Sign Recognition, we focus on explicitly creating sign embeddings that bridge the gap between sign language and spoken language. We propose a learning framework to derive LCC (Learnt Contrastive Concept) embeddings for sign language, a weakly supervised contrastive approach to learning sign embeddings. We train a vocabulary of embeddings that are based on the linguistic labels for sign video. Additionally, we develop a conceptual similarity loss which is able to utilise word embeddings from NLP methods to create sign embeddings that have better sign language to spoken language correspondence. These learnt representations allow the model to automatically localise the sign in time. Our approach achieves state-of-the-art keypoint-based sign recognition performance on the WLASL and BOBSL datasets.
翻译:[翻译摘要] 在口语语言的自然语言处理(NLP)中,词嵌入已被证明是编码词义的有效方法。手语作为视觉语言,需要能够捕捉手语视觉与语言语义的手语嵌入。与许多常见的手语识别方法不同,我们专注于显式创建弥合手语与口语之间鸿沟的手语嵌入。我们提出了一种学习框架,用于推导手语的学习型对比概念(LCC)嵌入——这是一种弱监督对比式手语嵌入学习方法。我们基于手语视频的语言标签训练了一组嵌入词汇表。此外,我们开发了一种概念相似性损失函数,能够利用NLP方法中的词嵌入来生成具有更好手语-口语对应关系的手语嵌入。这些学得的表示使模型能够自动定位手语在时间轴上的位置。我们的方法在WLASL和BOBSL数据集上达到了基于关键点的手语识别的最新性能。