E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models

Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.

翻译：语音基础模型在涉及声学领域偏移（如背景噪声和说话人口音）的实际场景中部署时，会遭遇显著的性能下降。测试时自适应（TTA）作为一种无需访问源数据或标签、在推理时应对此类领域偏移的策略，近年来备受关注。然而，现有的TTA方法，特别是那些依赖反向传播的方法，通常内存消耗巨大，限制了其在语音任务和资源受限环境中的应用。尽管免反向传播的方法能提供更高的效率，但现有方法的准确性普遍较低。这是因为这些方法主要针对视觉任务开发，而视觉任务在问题设定、噪声特性及模型架构上与语音任务存在根本差异，带来了独特的可迁移性挑战。本文提出了E-BATS，这是首个专为语音基础模型设计的高效免反向传播测试时自适应框架。E-BATS通过三个关键组件，在自适应效果与内存效率之间取得了平衡：（i）基于前向传播特征对齐的轻量级提示自适应；（ii）用于捕捉全局（话语层面）和局部（词元层面）分布偏移的多尺度损失函数；（iii）跨话语稳定自适应的测试时指数移动平均机制。在涵盖十六种声学条件的四个噪声语音数据集上进行的实验表明，该方法取得了稳定的性能提升：相较于免反向传播基线模型，准确率提升了4.1%-13.5%；与基于反向传播的方法相比，GPU内存占用减少了2.0-6.4倍。通过在声学变化条件下实现可扩展且鲁棒的自适应，本工作为开发适用于现实世界环境的、更高效的语音处理系统自适应方法铺平了道路。