Previous audio-visual speech separation methods use the synchronization of the speaker's facial movement and speech in the video to supervise the speech separation in a self-supervised way. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network for learning the combination of three modalities, audio, face, and sign language information, for better solving the speech separation problem. To train the model, we introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset, in which three modalities of audio, face, and sign language coexist. Experiment results show that the proposed model has better performance and robustness than the usual audio-visual system. Besides, sign language modality can also be used alone to supervise speech separation tasks, and the introduction of sign language is helpful for hearing-impaired people to learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech
翻译:以往的视听语音分离方法利用视频中说话者面部运动与语音的同步性,以自监督方式实现语音分离。本文提出了一种联合面部与手语辅助的语音分离模型,我们将此问题定义为扩展语音分离问题。我们设计了一个通用深度学习网络,用于学习音频、面部和手语三种模态信息的组合,以更优地解决语音分离问题。为训练该模型,我们引入了大规模数据集——中国手语新闻语音(CSLNSpeech)数据集,其中包含音频、面部和手语三种模态共存的数据。实验结果表明,与常规视听系统相比,所提模型具有更优的性能和鲁棒性。此外,手语模态可单独用于监督语音分离任务,同时手语的引入有助于听障人士的学习与交流。最后,我们的模型是一个通用语音分离框架,能在两个开源视听数据集上取得极具竞争力的分离性能。代码开源地址:https://github.com/iveveive/SLNSpeech