Speech separation remains an important topic for multi-speaker technology researchers. Convolution augmented transformers (conformers) have performed well for many speech processing tasks but have been under-researched for speech separation. Most recent state-of-the-art (SOTA) separation models have been time-domain audio separation networks (TasNets). A number of successful models have made use of dual-path (DP) networks which sequentially process local and global information. Time domain conformers (TD-Conformers) are an analogue of the DP approach in that they also process local and global context sequentially but have a different time complexity function. It is shown that for realistic shorter signal lengths, conformers are more efficient when controlling for feature dimension. Subsampling layers are proposed to further improve computational efficiency. The best TD-Conformer achieves 14.6 dB and 21.2 dB SISDR improvement on the WHAMR and WSJ0-2Mix benchmarks, respectively.
翻译:语音分离仍然是多说话人技术研究者的重要课题。卷积增强Transformer(Conformer)已在许多语音处理任务中表现优异,但在语音分离领域的研究尚不充分。最近最先进的分离模型多为时域音频分离网络(TasNets),其中许多成功的模型采用了双路径网络,通过顺序处理局部和全局信息。时域Conformer(TD-Conformer)在顺序处理局部和全局上下文方面与双路径方法类似,但具有不同的时间复杂度函数。研究表明,对于实际较短的信号长度,在控制特征维度时,Conformer具有更高的效率。本文进一步提出子采样层以提高计算效率。最佳TD-Conformer在WHAMR和WSJ0-2Mix基准测试上分别实现了14.6 dB和21.2 dB的SISDR提升。