Speaker extraction and diarization are two crucial enabling techniques for speech applications. Speaker extraction aims to extract a target speaker's voice from a multi-talk mixture, while speaker diarization demarcates speech segments by speaker, identifying `who spoke when'. The previous studies have typically treated the two tasks independently. However, the two tasks share a similar objective, that is to disentangle the speakers in the spectral domain for the former but in the temporal domain for the latter. It is logical to believe that the speaker turns obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker turns than the mixture speech. In this paper, we propose a unified framework called Universal Speaker Extraction and Diarization (USED). We extend the existing speaker extraction model to simultaneously extract the waveforms of all speakers. We also employ a scenario-aware differentiated loss function to address the problem of sparsely overlapped speech in real-world conversations. We show that the USED model significantly outperforms the baselines for both speaker extraction and diarization tasks, in both highly overlapped and sparsely overlapped scenarios. Audio samples are available at https://ajyy.github.io/demo/USED/.
翻译:摘要:说话人提取与说话人日志是语音应用中的两项关键赋能技术。说话人提取旨在从多说话人混合语音中提取目标说话人的声音,而说话人日志则按说话人划分语音片段,识别“谁在何时说话”。以往研究通常将这两项任务独立处理。然而,这两项任务具有相似的目标,即前者在频域中解耦说话人,后者在时域中解耦说话人。合乎逻辑的推测是,从说话人日志中获得的说话人轮次有助于说话人提取,而提取出的语音相比混合语音能提供更精确的说话人轮次。本文提出一个统一框架——通用说话人提取与说话人日志(USED)。我们扩展了现有的说话人提取模型,使其能同时提取所有说话人的波形。我们还采用场景感知差异化损失函数,以解决真实对话中语音稀疏重叠的问题。实验表明,USED模型在高度重叠和稀疏重叠场景下,均显著优于说话人提取与说话人日志任务的基线方法。音频样本详见https://ajyy.github.io/demo/USED/。