Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.
翻译:当语音基础模型部署于涉及声学域偏移(如背景噪声和说话人口音)的现实场景时,其性能会出现显著下降。测试时自适应(TTA)作为一种无需访问源数据或标签即可在推理阶段应对此类域偏移的策略,近年来受到关注。然而,现有的TTA方法,特别是依赖反向传播的方法,内存消耗大,限制了其在语音任务和资源受限环境中的应用。尽管免反向传播方法效率更高,但现有方法的准确性较差。这是因为它们主要针对视觉任务开发,而视觉任务在问题表述、噪声特性和模型架构方面与语音任务存在根本差异,带来了独特的可迁移性挑战。本文提出E-BATS,首个专为语音基础模型设计的高效免反向传播测试时自适应框架。E-BATS通过三个关键组件在自适应效果与内存效率之间取得平衡:(i)基于前向传播特征对齐的轻量级提示自适应,(ii)捕获全局(话语级)和局部(词元级)分布偏移的多尺度损失函数,以及(iii)跨话语稳定自适应的测试时指数移动平均机制。在涵盖十六种声学条件的四个噪声语音数据集上的实验表明,该方法取得了稳定的性能提升:相较于免反向传播基线模型,准确率提升4.1%-13.5%;与基于反向传播的方法相比,GPU内存节省2.0-6.4倍。通过在声学变化下实现可扩展且鲁棒的自适应,本工作为开发适用于现实环境的高效语音处理系统自适应方法铺平了道路。