This paper addresses the prevalent issue of incorrect speech output in audio-visual speech enhancement (AVSE) systems, which is often caused by poor video quality and mismatched training and test data. We introduce a post-processing classifier (PPC) to rectify these erroneous outputs, ensuring that the enhanced speech corresponds accurately to the intended speaker. We also adopt a mixup strategy in PPC training to improve its robustness. Experimental results on the AVSE-challenge dataset show that integrating PPC into the AVSE model can significantly improve AVSE performance, and combining PPC with the AVSE model trained with permutation invariant training (PIT) yields the best performance. The proposed method substantially outperforms the baseline model by a large margin. This work highlights the potential for broader applications across various modalities and architectures, providing a promising direction for future research in this field.
翻译:本文针对音视频语音增强(AVSE)系统中普遍存在的语音输出错误问题展开研究,该问题通常由视频质量不佳以及训练与测试数据不匹配所引起。我们引入了一种后处理分类器(PPC)来校正这些错误输出,确保增强后的语音能准确对应目标说话人。在PPC训练中,我们采用混合增强策略以提升其鲁棒性。在AVSE-challenge数据集上的实验结果表明,将PPC集成到AVSE模型中可显著提升系统性能,且PPC与基于排列不变训练(PIT)的AVSE模型结合时能获得最佳性能。所提方法较基线模型实现了大幅性能提升。这项工作展现了该方法在多模态与多架构场景中更广泛应用的潜力,为该领域的未来研究提供了具有前景的发展方向。