Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition

Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlation Coefficient (CCC) arose as an alternative metric for A/D/V where a model's output is evaluated to match a whole dataset's CCC rather than L2 distances of individual audios. Recent studies have shown that wav2vec2 / wavLM architectures outputing a float value for each A/D/V dimension achieve today's State-of-the-art (Sota) CCC on A/D/V. The Wav2Vec2.0 / WavLm family has a high computational footprint, but training small models using human annotations has been unsuccessful. In this paper we use a large Transformer Sota A/D/V model as Teacher/Annotator to train 5 student models: 4 MobileNets and our proposed Wav2Small, using only the Teacher's A/D/V predictions instead of human annotations. The Teacher model sets a new Sota on the MSP Podcast dataset of valence CCC = 0.676. We choose MobileNetV4 / MobileNet-V3 as students, as MobileNet has been designed for fast execution times. We also propose Wav2Small - an architecture designed for minimal parameter number and RAM consumption. Wav2Small with an .onnx (8bit quantized) of only 120KB is a potential solution for A/D/V on hardware with low resources, having only 72K parameters vs 3.12M parameters for MobileNet-V4-Small.

翻译：语音情感识别（SER）需要大量计算资源以克服标注者间存在显著分歧的挑战。当前SER正转向对唤醒度、支配度和效价（A/D/V）的维度标注。由于标注者意见共识难以收敛，L2距离等通用指标被证明不适用于评估A/D/V准确性。而一致性相关系数（CCC）作为替代指标应运而生，该指标通过评估模型输出与整个数据集的CCC匹配程度，而非单个音频的L2距离。近期研究表明，为每个A/D/V维度输出浮点值的wav2vec2/wavLM架构在A/D/V上实现了当前最佳CCC性能。Wav2Vec2.0/WavLm系列模型计算开销较大，但使用人工标注训练小型模型始终未能成功。本文采用大型Transformer架构的Sota A/D/V模型作为教师/标注器，训练5个学生模型：包括4个MobileNet及我们提出的Wav2Small，训练过程仅使用教师模型的A/D/V预测而非人工标注。该教师模型在MSP Podcast数据集上创造了效价CCC=0.676的最新Sota记录。我们选择MobileNetV4/MobileNet-V3作为学生模型，因其专为快速执行而设计。同时提出Wav2Small——一种为最小参数量和内存消耗设计的架构。经8位量化后仅120KB的.onnx格式Wav2Small模型（参数量仅72K，而MobileNet-V4-Small参数量达3.12M）有望成为低资源硬件上部署A/D/V系统的潜在解决方案。

相关内容

CCC

关注 0

CCC旨在促进计算复杂性理论的所有领域的研究，研究资源约束下计算模型的绝对和相对功率。典型的模型包括确定性模型、不确定性模型、随机模型和量子模型；均匀模型和非均匀模型；布尔模型、代数模型和连续模型。典型的资源约束包括时间、空间、随机性、程序大小、输入查询、通信和纠缠；最坏情况和平均情况。其他更具体的主题包括：概率和交互证明系统、不可近似性、证明复杂性、描述复杂性以及密码和机器学习的复杂性理论方面。会议还鼓励其他领域的计算机科学和数学的动机计算复杂性理论。官网链接：http://computationalcomplexity.org/