Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlation Coefficient (CCC) arose as an alternative metric for A/D/V where a model's output is evaluated to match a whole dataset's CCC rather than L2 distances of individual audios. Recent studies have shown that wav2vec2 / wavLM architectures outputing a float value for each A/D/V dimension achieve today's State-of-the-art (Sota) CCC on A/D/V. The Wav2Vec2.0 / WavLM family has a high computational footprint, but training small models using human annotations has been unsuccessful. In this paper we use a large Transformer Sota A/D/V model as Teacher/Annotator to train 5 student models: 4 MobileNets and our proposed Wav2Small, using only the Teacher's A/D/V outputs instead of human annotations. The Teacher model we propose also sets a new Sota on the MSP Podcast dataset of valence CCC=0.676. We choose MobileNetV4 / MobileNet-V3 as students, as MobileNet has been designed for fast execution times. We also propose Wav2Small - an architecture designed for minimal parameters and RAM consumption. Wav2Small with an .onnx (quantised) of only 120KB is a potential solution for A/D/V on hardware with low resources, having only 72K parameters vs 3.12M parameters for MobileNet-V4-Small.
翻译:语音情感识别需要大量计算资源以克服标注者间存在显著分歧的挑战。当前,语音情感识别正转向对唤醒度、支配度和效价进行维度标注。由于标注者意见难以达成一致共识,L2距离等通用度量标准被证明不适用于评估A/D/V的准确性。然而,一致性相关系数作为一种替代性度量标准在A/D/V任务中兴起,该标准通过评估模型输出与整个数据集CCC的匹配程度(而非单个音频的L2距离)来进行衡量。近期研究表明,能为每个A/D/V维度输出浮点值的wav2vec2/wavLM架构在当前A/D/V任务中达到了最先进的CCC性能。Wav2Vec2.0/WavLM系列模型虽计算开销较大,但使用人工标注训练小型模型始终未能成功。本文采用大型Transformer架构的Sota A/D/V模型作为教师/标注器,仅使用教师模型的A/D/V输出(而非人工标注)训练了5个学生模型:包括4个MobileNet变体及我们提出的Wav2Small。我们提出的教师模型同时在MSP Podcast数据集的效价维度上创造了CCC=0.676的新Sota记录。选择MobileNetV4/MobileNet-V3作为学生模型,因其专为快速推理而设计。我们还提出了Wav2Small——一种为最小化参数量和内存消耗而设计的架构。经量化后的Wav2Small模型.onnx文件仅120KB,参数量仅72K(对比MobileNet-V4-Small的3.12M参数),为低资源硬件部署A/D/V系统提供了潜在解决方案。