Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios.
翻译:近年来,神经音频编解码器(NAC)模型的改进引发了人们将预训练编解码器应用于各种语音处理任务的兴趣,以期利用高压缩率带来的效率优势,但该方法尚未应用于语音分离(SS)任务。语音分离可从高压缩中受益,因为传统语音分离模型所需的计算量使其在许多边缘计算场景中不切实际。然而,语音分离是一项波形掩蔽任务,压缩往往会引入严重影响性能的失真。本文提出了一项新颖的基于音频编解码器的语音分离任务,即在神经音频编解码器的嵌入空间内执行语音分离,并提出了一种新模型——Codecformer来解决该任务。在推理阶段,Codecformer实现了52倍的乘累加运算量减少,同时产生的分离性能可与云端部署的Sepformer相媲美。该方法为在实际场景中实现高效语音分离开辟了新方向。