Over the past few years, audio classification task on large-scale dataset such as AudioSet has been an important research area. Several deeper Convolution-based Neural networks have shown compelling performance notably Vggish, YAMNet, and Pretrained Audio Neural Network (PANN). These models are available as pretrained architecture for transfer learning as well as specific audio task adoption. In this paper, we propose a lightweight on-device deep learning-based model for audio classification, LEAN. LEAN consists of a raw waveform-based temporal feature extractor called as Wave Encoder and logmel-based Pretrained YAMNet. We show that using a combination of trainable wave encoder, Pretrained YAMNet along with cross attention-based temporal realignment, results in competitive performance on downstream audio classification tasks with lesser memory footprints and hence making it suitable for resource constraints devices such as mobile, edge devices, etc . Our proposed system achieves on-device mean average precision(mAP) of .445 with a memory footprint of a mere 4.5MB on the FSD50K dataset which is an improvement of 22% over baseline on-device mAP on same dataset.
翻译:在过去几年中,基于大规模数据集(如AudioSet)的音频分类任务一直是一个重要的研究领域。多种基于卷积的深度神经网络已展现出令人瞩目的性能,特别是Vggish、YAMNet和预训练音频神经网络(PANN)。这些模型可作为预训练架构用于迁移学习及特定音频任务的适配。本文提出了一种轻量级端侧深度学习模型LEAN,用于音频分类。LEAN包含一个基于原始波形的时序特征提取器(称为Wave Encoder)和基于对数梅尔频谱的预训练YAMNet。我们证明,通过结合可训练的Wave Encoder、预训练YAMNet以及基于交叉注意力的时序重对齐,能在下游音频分类任务上取得具有竞争力的性能,同时显著降低内存占用,从而使其适用于移动设备、边缘设备等资源受限场景。我们的系统在FSD50K数据集上实现了0.445的端侧平均精度均值(mAP),内存占用仅为4.5MB,相比同一数据集上的基线端侧mAP提升了22%。