Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time-frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker as a conditioning factor during the extraction process. Recognizing the inherent dissimilarities between speech and noise signals as interfering sources, we also propose SAV-GridNet, a scenario-aware model that identifies the type of interfering scenario first and then applies a dedicated expert model trained specifically for that scenario. Our proposed model achieves SOTA results on the second COG-MHEAR Audio-Visual Speech Enhancement Challenge, outperforming other models by a significant margin, objectively and in a listening test. We also perform an extensive analysis of the results under the two scenarios.
翻译:目标语音提取旨在基于给定的条件线索,提取被干扰源(如噪声或竞争说话者)污染的目标语音信号。基于当前最先进(SOTA)时频说话人分离模型TF-GridNet的成果,我们提出AV-GridNet,这是一种视觉引导的变体,它将目标说话人的面部记录作为提取过程中的条件因素。考虑到语音和噪声信号作为干扰源的内在差异,我们还提出SAV-GridNet,这是一种场景感知模型,它首先识别干扰场景的类型,然后应用针对该场景专门训练的专用专家模型。我们提出的模型在第二届COG-MHEAR视听语音增强挑战赛中取得了SOTA结果,在客观指标和听力测试中均显著优于其他模型。我们还对两种场景下的结果进行了广泛分析。