Recent advancements in Neural Audio Codec (NAC) models have inspired their use in various speech processing tasks, including speech enhancement (SE). In this work, we propose a novel, efficient SE approach by leveraging the pre-quantization output of a pretrained NAC encoder. Unlike prior NAC-based SE methods, which process discrete speech tokens using Language Models (LMs), we perform SE within the continuous embedding space of the pretrained NAC, which is highly compressed along the time dimension for efficient representation. Our lightweight SE model, optimized through an embedding-level loss, delivers results comparable to SE baselines trained on larger datasets, with a significantly lower real-time factor of 0.005. Additionally, our method achieves a low GMAC of 3.94, reducing complexity 18-fold compared to Sepformer in a simulated cloud-based audio transmission environment. This work highlights a new, efficient NAC-based SE solution, particularly suitable for cloud applications where NAC is used to compress audio before transmission. Copyright 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
翻译:神经音频编解码器模型的最新进展启发了其在多种语音处理任务中的应用,包括语音增强。在本研究中,我们提出了一种新颖高效的语音增强方法,该方法利用预训练NAC编码器的预量化输出。与先前基于NAC、使用语言模型处理离散语音令牌的语音增强方法不同,我们在预训练NAC的连续嵌入空间内执行语音增强,该空间沿时间维度高度压缩以实现高效表示。我们通过嵌入级损失优化的轻量级语音增强模型,取得了与在更大数据集上训练的语音增强基线相当的结果,且实时因子显著降低至0.005。此外,我们的方法实现了3.94的低GMAC,在模拟的云端音频传输环境中,与Sepformer相比复杂度降低了18倍。这项工作展示了一种新颖、高效的基于NAC的语音增强解决方案,特别适用于在传输前使用NAC压缩音频的云端应用场景。版权20XX IEEE。允许个人使用本材料。所有其他用途,包括在任何当前或未来的媒体中为广告或促销目的重印/再版本材料、创建新的集体作品、转售或重新分发到服务器或列表,或在其他作品中重用本作品的任何受版权保护组件,均须获得IEEE的许可。