The common target speech separation directly estimate the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation model (MTSS) to simultaneously extract each speaker's voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS), which consists of a SD module and MTSS module. By exploiting the TSVAD decision and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves 1.38dB SDR, 1.34dB SI-SDR, and 0.13 PESQ improvements over the baseline on the WSJ0-2mix-extr dataset, respectively. The SD-MTSS system makes 19.2% relative speaker dependent character error rate (CER) reduction on the Alimeeting dataset.
翻译:针对混合语音中多说话人相互干扰问题,现有目标语音分离方法通常直接估计目标声源,忽略了每帧内不同说话人之间的关联性。本文提出多目标语音分离模型(MTSS),可同时提取混合语音中每位说话人的语音信号,而非仅最优估计单个目标声源。进一步,我们构建了说话人日志(SD)感知的MTSS系统(SD-MTSS),由SD模块与MTSS模块组成。通过利用TSVAD判决结果与估计掩码,SD-MTSS模型可在无需预注册语音的条件下,从对话录音中同步提取每位说话人的语音信号。实验结果表明,在WSJ0-2mix-extr数据集上,所提MTSS模型相较于基线在SDR、SI-SDR和PESQ指标上分别提升1.38dB、1.34dB和0.13;在Alimeeting数据集上,SD-MTSS系统使说话人相关字符错误率(CER)相对降低19.2%。