Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characteristics, complicating their discrimination. Target speech/speaker extraction (TSE) isolates the speech signal of a target speaker from a mixture of several speakers with or without noises and reverberations using clues that identify the speaker in the mixture. Such clues might be a spatial clue indicating the direction of the target speaker, a video of the speaker's lips, or a pre-recorded enrollment utterance from which their voice characteristics can be derived. TSE is an emerging field of research that has received increased attention in recent years because it offers a practical approach to the cocktail-party problem and involves such aspects of signal processing as audio, visual, array processing, and deep learning. This paper focuses on recent neural-based approaches and presents an in-depth overview of TSE. We guide readers through the different major approaches, emphasizing the similarities among frameworks and discussing potential future directions.
翻译:人类即使在存在噪声、混响和干扰说话者的复杂声学环境中,也能专注于目标说话者。这种现象被称为鸡尾酒会效应。数十年来,研究者一直致力于接近人类的听觉能力。其中一个关键问题是处理干扰说话者,因为目标语音与非目标语音具有相似特征,难以区分。目标语音/说话者提取(TSE)利用识别混合语音中目标说话者的线索,从多个说话者的混合信号中分离出目标说话者的语音信号,其中可能包含或不包含噪声和混响。这些线索可能是指示目标说话者方向的空间线索、说话者嘴唇的视频,或预先录制的注册语音,从中可提取其声音特征。TSE是一个新兴研究领域,近年来受到越来越多关注,因为它为解决鸡尾酒会问题提供了实用方法,并涉及信号处理中的音频、视频、阵列处理及深度学习等多个方面。本文聚焦于近期基于神经网络的TSE方法,对该领域进行了深入概述。我们引导读者了解不同主要方法,强调各框架间的相似性,并探讨了未来潜在的研究方向。