Noise-robust automatic speech recognition (ASR) has been commonly addressed by applying speech enhancement (SE) at the waveform level before recognition. However, speech-level enhancement does not always translate into consistent recognition improvements due to residual distortions and mismatches with the latent space of the ASR encoder. In this letter, we introduce a complementary strategy termed latent-level enhancement, where distorted representations are refined during ASR inference. Specifically, we propose a plug-and-play Flow Matching Refinement module (FM-Refiner) that operates on the output latents of a pretrained CTC-based ASR encoder. Trained to map imperfect latents-either directly from noisy inputs or from enhanced-but-imperfect speech-toward their clean counterparts, the FM-Refiner is applied only at inference, without fine-tuning ASR parameters. Experiments show that FM-Refiner consistently reduces word error rate, both when directly applied to noisy inputs and when combined with conventional SE front-ends. These results demonstrate that latent-level refinement via flow matching provides a lightweight and effective complement to existing SE approaches for robust ASR.
翻译:噪声鲁棒性自动语音识别通常采用在识别前对波形进行语音增强的方法。然而,由于残留失真以及与ASR编码器潜在空间的不匹配,语音层面的增强并不总能带来一致的识别性能提升。本文提出一种称为潜在层增强的互补策略,即在ASR推理过程中对失真表征进行优化。具体而言,我们设计了一种即插即用的流匹配优化模块,该模块作用于预训练的基于CTC的ASR编码器的输出潜在表征。FM-Refiner通过训练将不完美的潜在表征(无论是直接来自噪声输入,还是来自经过增强但仍不完美的语音)映射到其对应的干净版本,且仅在推理阶段使用,无需微调ASR参数。实验表明,无论是直接应用于噪声输入,还是与传统语音增强前端结合使用,FM-Refiner均能持续降低词错误率。这些结果证明,通过流匹配实现的潜在层优化为现有鲁棒ASR的语音增强方法提供了一种轻量且有效的补充。