Acoustic foundation models, fine-tuned for Automatic Speech Recognition (ASR), suffer from performance degradation in wild acoustic test settings when deployed in real-world scenarios. Stabilizing online Test-Time Adaptation (TTA) under these conditions remains an open and unexplored question. Existing wild vision TTA methods often fail to handle speech data effectively due to the unique characteristics of high-entropy speech frames, which are unreliably filtered out even when containing crucial semantic content. Furthermore, unlike static vision data, speech signals follow short-term consistency, requiring specialized adaptation strategies. In this work, we propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models. Our method, Confidence-Enhanced Adaptation, performs frame-level adaptation using a confidence-aware weight scheme to avoid filtering out essential information in high-entropy frames. Additionally, we apply consistency regularization during test-time optimization to leverage the inherent short-term consistency of speech signals. Our experiments on both synthetic and real-world datasets demonstrate that our approach outperforms existing baselines under various wild acoustic test settings, including Gaussian noise, environmental sounds, accent variations, and sung speech.
翻译:针对自动语音识别(ASR)任务进行微调的声学基础模型,在真实场景中部署时,面对野外声学测试环境会出现性能下降。在此类条件下实现稳定的在线测试时自适应(TTA)仍是一个尚未解决且未被充分探索的问题。现有的野外视觉TTA方法通常无法有效处理语音数据,这是由于高熵语音帧的独特特性所致:即使这些帧包含关键的语义内容,也常被不可靠地过滤掉。此外,与静态视觉数据不同,语音信号遵循短期一致性,需要专门的适应策略。在本研究中,我们提出了一种专为ASR微调声学基础模型设计的新型野外声学TTA方法。我们的方法——置信度增强自适应——采用置信度感知的加权方案执行帧级自适应,以避免过滤掉高熵帧中的关键信息。此外,我们在测试时优化过程中应用一致性正则化,以利用语音信号固有的短期一致性。我们在合成数据集和真实世界数据集上的实验表明,在多种野外声学测试环境(包括高斯噪声、环境声音、口音变化和歌唱语音)下,我们的方法均优于现有基线。