This paper introduces a novel convolutional neural networks (CNN) framework tailored for end-to-end audio deep learning models, presenting advancements in efficiency and explainability. By benchmarking experiments on three standard speech emotion recognition datasets with five-fold cross-validation, our framework outperforms Mel spectrogram features by up to seven percent. It can potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while remaining lightweight. Furthermore, we demonstrate the efficiency and interpretability of the front-end layer using the PhysioNet Heart Sound Database, illustrating its ability to handle and capture intricate long waveform patterns. Our contributions offer a portable solution for building efficient and interpretable models for raw waveform data.
翻译:本文提出了一种专为端到端音频深度学习模型设计的新型卷积神经网络(CNN)框架,在效率与可解释性方面均取得了进展。通过在三个标准语音情感识别数据集上采用五折交叉验证进行基准实验,我们的框架相比梅尔频谱图特征实现了高达7%的性能提升。该框架具有替代梅尔频率倒谱系数(MFCC)的潜力,同时保持轻量化。此外,我们利用PhysioNet心音数据库验证了前端层的效率与可解释性,展示了其处理并捕捉复杂长波形模式的能力。我们的贡献为构建面向原始波形数据的高效可解释模型提供了一种便携式解决方案。