AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.
翻译:AV-HuBERT,一种多模态自监督学习模型,已被证明在自动语音识别和唇读等分类问题中有效。这表明通过利用多模态自监督嵌入可以获得有用的视听语音表征。然而,尚不清楚此类表征是否能泛化到解决现实世界中的多模态视听回归任务,例如视听语音增强和视听语音分离。在本研究中,我们利用预训练的AV-HuBERT模型后接一个SE模块,用于AVSE和AVSS任务。对比实验结果表明,我们所提出的模型性能优于最先进的AVSE模型和传统的纯音频SE模型。总之,我们的结果证实了所提出模型在采用适当微调策略的AVSS任务中具有有效性,表明从AV-HuBERT获得的多模态自监督嵌入能够泛化到视听回归任务。