The practical utility of Speech Emotion Recognition (SER) systems is undermined by their fragility to domain shifts, such as speaker variability, the distinction between acted and naturalistic emotions, and cross-corpus variations. While domain adaptation and fine-tuning are widely studied, they require either source data or labelled target data, which are often unavailable or raise privacy concerns in SER. Test-time adaptation (TTA) bridges this gap by adapting models at inference using only unlabeled target data. Yet, having been predominantly designed for image classification and speech recognition, the efficacy of TTA for mitigating the unique domain shifts in SER has not been investigated. In this paper, we present the first systematic evaluation and comparison covering 11 TTA methods across three representative SER tasks. The results indicate that backpropagation-free TTA methods are the most promising. Conversely, entropy minimization and pseudo-labeling generally fail, as their core assumption of a single, confident ground-truth label is incompatible with the inherent ambiguity of emotional expression. Further, no single method universally excels, and its effectiveness is highly dependent on the distributional shifts and tasks.
翻译:语音情感识别(SER)系统的实际应用价值因其对领域偏移的脆弱性而受到削弱,例如说话人变异性、表演情感与自然情感的区分以及跨语料库的差异。尽管领域自适应和微调已被广泛研究,但它们通常需要源数据或带标签的目标数据,而这些数据在SER中往往难以获取或引发隐私担忧。测试时自适应(TTA)通过仅使用未标记的目标数据在推理阶段调整模型,从而弥补了这一差距。然而,TTA主要针对图像分类和语音识别任务设计,其在缓解SER中特有领域偏移方面的有效性尚未得到探究。本文首次对涵盖三个代表性SER任务的11种TTA方法进行了系统评估与比较。结果表明,无需反向传播的TTA方法最具前景。相反,熵最小化和伪标记方法通常效果不佳,因为其核心假设——存在单一、确定性的真实标签——与情感表达固有的模糊性不相容。此外,没有一种方法能在所有情况下表现优异,其有效性高度依赖于分布偏移和具体任务。