Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios.
翻译:近年来,神经音频编解码器(NAC)模型的改进引发了人们对在各种语音处理应用中采用预训练编解码器的兴趣,以期利用高压缩率带来的效率优势,但这些方法尚未应用于语音分离(SS)任务。语音分离可以从高压缩中受益,因为传统语音分离模型所需的计算量使其在许多边缘计算应用场景中不切实际。然而,语音分离是一种波形掩蔽任务,压缩往往会引入严重影响性能的失真。在此,我们提出了一项新颖的基于音频编解码器的语音分离任务,即在神经音频编解码器的嵌入空间内执行语音分离,并提出了一种新模型——Codecformer——来解决此任务。在推理阶段,Codecformer实现了52倍的乘累加运算(MAC)减少,同时产生的分离性能可与云端部署的Sepformer相媲美。该方法为在实际场景中执行高效语音分离开辟了新方向。