Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.
翻译:离散音频表征因其可解释性以及与大规模语言模型的兼容性,在语音建模领域日益受到关注,但其在噪声或真实环境中的性能并非总是最优。基于现有将Whisper嵌入量化用于语音到单元建模的研究,我们提出在潜在空间中解耦语义语音内容与背景噪声。我们的端到端模型以码本令牌的形式分离纯净语音,同时提取可解释的噪声向量作为量化残差,并通过轻量级分类器进行监督。研究表明,该方法提升了纯净/含噪语音与文本之间的对齐度,生成的语音令牌表现出高度的噪声不变性,并改善了自动语音识别性能。在保持Whisper参数冻结的条件下,我们在VBDemand测试集上实现了相比Whisper错误率降低82%,较基线方法提升35%的性能。进一步分析表明,学习到的令牌空间对已知和未知声学条件均具有良好的泛化能力。