Automatic speech recognition (ASR) models are frequently exposed to data distribution shifts in many real-world scenarios, leading to erroneous predictions. To tackle this issue, an existing test-time adaptation (TTA) method has recently been proposed to adapt the pre-trained ASR model on unlabeled test instances without source data. Despite decent performance gain, this work relies solely on naive greedy decoding and performs adaptation across timesteps at a frame level, which may not be optimal given the sequential nature of the model output. Motivated by this, we propose a novel TTA framework, dubbed SGEM, for general ASR models. To treat the sequential output, SGEM first exploits beam search to explore candidate output logits and selects the most plausible one. Then, it utilizes generalized entropy minimization and negative sampling as unsupervised objectives to adapt the model. SGEM achieves state-of-the-art performance for three mainstream ASR models under various domain shifts.
翻译:自动语音识别(ASR)模型在众多真实场景中频繁面临数据分布偏移,导致预测错误。为解决该问题,现有测试时自适应(TTA)方法被提出,通过无标签测试实例对预训练ASR模型进行自适应,无需源数据。尽管实现了显著性能提升,该方法仅依赖朴素贪婪解码,并在帧级别跨时间步进行自适应,鉴于模型输出的序列特性,这未必是最优方案。受此启发,我们提出新型TTA框架SGEM,适用于通用ASR模型。为处理序列输出,SGEM首先利用束搜索探索候选输出logits并筛选最合理结果,随后采用广义熵最小化与负采样作为无监督目标进行模型自适应。在多种域偏移场景下,针对三种主流ASR模型,SGEM均实现了最优性能。