This work investigates modelling strategies in continuous and discrete latent spaces in the vector quantisation (VQ)-based neural audio codec (NAC) speech enhancement (SE), along with the role of VQ regularisation. We propose cNAC-SE and dNAC-SE frameworks that predict continuous representations and discrete tokens in latent space, respectively. Theoretical analysis and visualisations in latent space are performed to exhibit their inherent modelling mechanisms. Experimental results show that the fully fine-tuned cNAC-SE model consistently outperforms all dNAC-SE variants across diverse test conditions and achieves leading performance among established generative approaches in DNS-MOS metrics. Comparison with the discriminative counterpart shows that VQ enhances robustness through an intrinsic effect of clean-prior-constrained regularisation, independent of discrete token processing. This highlights the transferable value of VQ regularisation to other continuous modelling methods.
翻译:本研究探讨了基于向量量化神经音频编解码器(NAC)的语音增强(SE)方法中,连续与离散潜在空间的建模策略,以及VQ正则化的作用。我们提出cNAC-SE和dNAC-SE框架,分别预测潜在空间中的连续表征与离散令牌。通过理论分析与潜在空间可视化,揭示了其内在建模机制。实验结果表明,经过全面微调的cNAC-SE模型在不同测试条件下始终优于所有dNAC-SE变体,并在DNS-MOS指标中达到生成式方法的领先性能。与判别式对应模型的对比显示,VQ通过干净先验约束正则化的内在效应增强鲁棒性,该效应独立于离散令牌处理过程,凸显了VQ正则化对其他连续建模方法的可迁移价值。