Speech emotion recognition (SER) with audio-language models (ALMs) remains vulnerable to distribution shifts at test time, leading to performance degradation in out-of-domain scenarios. Test-time adaptation (TTA) provides a promising solution but often relies on gradient-based updates or prompt tuning, limiting flexibility and practicality. We propose Emo-TTA, a lightweight, training-free adaptation framework that incrementally updates class-conditional statistics via an Expectation-Maximization procedure for explicit test-time distribution estimation, using ALM predictions as priors. Emo-TTA operates on individual test samples without modifying model weights. Experiments on six out-of-domain SER benchmarks show consistent accuracy improvements over prior TTA baselines, demonstrating the effectiveness of statistical adaptation in aligning model predictions with evolving test distributions.
翻译:基于音频-语言模型(ALMs)的语音情感识别(SER)在测试时仍易受分布偏移的影响,导致在域外场景中性能下降。测试时适应(TTA)提供了一种有前景的解决方案,但通常依赖于基于梯度的更新或提示调优,限制了灵活性和实用性。我们提出了Emo-TTA,一个轻量级、无需训练的适应框架,它通过期望最大化过程,以ALM预测作为先验,增量更新类条件统计量以进行显式的测试时分布估计。Emo-TTA在单个测试样本上运行,无需修改模型权重。在六个域外SER基准测试上的实验表明,相较于先前的TTA基线,该方法实现了准确率的持续提升,证明了统计适应在使模型预测与演变的测试分布对齐方面的有效性。