Extracting the speech of participants in a conversation amidst interfering speakers and noise presents a challenging problem. In this paper, we introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants. To accomplish this, we propose leveraging temporal patterns inherent in human conversations, particularly turn-taking dynamics, which uniquely characterize speakers engaged in conversation and distinguish them from interfering speakers and noise. Using neural networks, we show the feasibility of our approach on English and Mandarin conversation datasets. In the presence of interfering speakers, our results show an 8.19 dB improvement in signal-to-noise ratio for 2-speaker conversations and a 7.92 dB improvement for 2-4-speaker conversations. Code, dataset available at https://github.com/chentuochao/Target-Conversation-Extraction.
翻译:在干扰说话者和噪声环境中提取对话参与者的语音是一个具有挑战性的问题。本文提出目标对话提取这一新任务,其目标是根据目标对话中一位参与者的说话人嵌入,提取该对话的音频。为实现此目标,我们提出利用人类对话中固有的时间模式,特别是话轮转换动态,其能独特地表征参与对话的说话者,并将其与干扰说话者及噪声区分开来。通过神经网络,我们在英语和普通话对话数据集上验证了该方法的可行性。在存在干扰说话者的情况下,我们的方法在双人对话中实现了8.19 dB的信噪比提升,在2-4人对话中实现了7.92 dB的信噪比提升。代码及数据集详见 https://github.com/chentuochao/Target-Conversation-Extraction。