We revisit the INTERSPEECH 2009 Emotion Challenge -- the first ever speech emotion recognition (SER) challenge -- and evaluate a series of deep learning models that are representative of the major advances in SER research in the time since then. We start by training each model using a fixed set of hyperparameters, and further fine-tune the best-performing models of that initial setup with a grid search. Results are always reported on the official test set with a separate validation set only used for early stopping. Most models score below or close to the official baseline, while they marginally outperform the original challenge winners after hyperparameter tuning. Our work illustrates that, despite recent progress, FAU-AIBO remains a very challenging benchmark. An interesting corollary is that newer methods do not consistently outperform older ones, showing that progress towards `solving' SER is not necessarily monotonic.
翻译:我们重新审视了INTERSPEECH 2009情感挑战赛——史上首次语音情感识别(SER)挑战赛——并评估了一系列深度学习模型,这些模型代表了此后SER研究的主要进展。我们首先使用固定的超参数集训练每个模型,然后通过网格搜索进一步微调初始设置中表现最佳的模型。结果始终基于官方测试集报告,仅使用独立的验证集进行早停。大多数模型的得分低于或接近官方基线,而经过超参数调优后,它们仅略优于原始挑战赛的获胜者。我们的工作表明,尽管近年来取得了一定进展,FAU-AIBO仍然是一个极具挑战性的基准。一个有趣的推论是,新方法并未持续优于旧方法,这表明在“解决”SER方面的进展并非单调递增。