Test-Time Adaptation (TTA) is a critical paradigm for tackling distribution shifts during inference, especially in visual recognition tasks. However, while acoustic models face similar challenges due to distribution shifts in test-time speech, TTA techniques specifically designed for acoustic modeling in the context of open-world data shifts remain scarce. This gap is further exacerbated when considering the unique characteristics of acoustic foundation models: 1) they are primarily built on transformer architectures with layer normalization and 2) they deal with test-time speech data of varying lengths in a non-stationary manner. These aspects make the direct application of vision-focused TTA methods, which are mostly reliant on batch normalization and assume independent samples, infeasible. In this paper, we delve into TTA for pre-trained acoustic models facing open-world data shifts. We find that noisy, high-entropy speech frames, often non-silent, carry key semantic content. Traditional TTA methods might inadvertently filter out this information using potentially flawed heuristics. In response, we introduce a heuristic-free, learning-based adaptation enriched by confidence enhancement. Noting that speech signals' short-term consistency, we also apply consistency regularization during test-time optimization. Our experiments on synthetic and real-world datasets affirm our method's superiority over existing baselines.
翻译:测试时自适应(TTA)是推理阶段应对分布偏移的关键范式,尤其在视觉识别任务中效果显著。然而,尽管声学模型在测试阶段同样面临语音数据分布偏移的挑战,针对开放世界数据偏移场景下声学建模的TTA技术仍然稀缺。这一局限在考虑声学基础模型的独特特性时更为突出:1)该类模型主要基于配备层归一化的Transformer架构,2)需以非平稳方式处理长度可变的测试语音数据。这些特性导致主要依赖批归一化且假设样本独立的视觉TTA方法难以直接应用。本文深入研究了面向开放世界数据偏移的预训练声学模型的TTA问题。我们发现包含高熵噪声的语音帧(多为非静音帧)承载着关键语义信息,而传统TTA方法可能通过存在缺陷的启发式策略无意过滤这些信息。为此,我们提出无启发式的自适应学习方法,通过置信度增强实现特征优化。基于语音信号的短时一致性特征,我们在测试时优化过程中引入一致性正则化机制。在合成数据集与真实数据集上的实验结果验证了本方法相较于现有基线的优越性。