Comparison of semi-supervised deep learning algorithms for audio classification

from arxiv, 9 pages, 5 figures, 5 tables. This is the version 3 of the paper. Contains minor fixes compared to the EURASIP one (which is the version 2 of the paper)

In this article, we adapted five recent SSL methods to the task of audio classification. The first two methods, namely Deep Co-Training (DCT) and Mean Teacher (MT), involve two collaborative neural networks. The three other algorithms, called MixMatch (MM), ReMixMatch (RMM), and FixMatch (FM), are single-model methods that rely primarily on data augmentation strategies. Using the Wide-ResNet-28-2 architecture in all our experiments, 10% of labeled data and the remaining 90% as unlabeled data for training, we first compare the error rates of the five methods on three standard benchmark audio datasets: Environmental Sound Classification (ESC-10), UrbanSound8K (UBS8K), and Google Speech Commands (GSC). In all but one cases, MM, RMM, and FM outperformed MT and DCT significantly, MM and RMM being the best methods in most experiments. On UBS8K and GSC, MM achieved 18.02% and 3.25% error rate (ER), respectively, outperforming models trained with 100% of the available labeled data, which reached 23.29% and 4.94%, respectively. RMM achieved the best results on ESC-10 (12.00% ER), followed by FM which reached 13.33%. Second, we explored adding the mixup augmentation, used in MM and RMM, to DCT, MT, and FM. In almost all cases, mixup brought consistent gains. For instance, on GSC, FM reached 4.44% and 3.31% ER without and with mixup. Our PyTorch code will be made available upon paper acceptance at https:// github. com/ Labbe ti/ SSLH.

翻译：本文针对音频分类任务，改编了五种近期提出的半监督学习(SSL)方法。前两种方法，即深度协同训练(DCT)和均值教师(MT)，采用双协作神经网络架构。其余三种算法——MixMatch(MM)、ReMixMatch(RMM)和FixMatch(FM)——属于单模型方法，主要依赖数据增强策略。我们在所有实验中统一采用Wide-ResNet-28-2架构，使用10%标注数据与90%未标注数据进行训练，首先在三个标准音频基准数据集：环境声音分类(ESC-10)、UrbanSound8K(UBS8K)和谷歌语音命令(GSC)上比较了五种方法的错误率。除个别情况外，MM、RMM和FM的性能显著优于MT与DCT，其中MM和RMM在多数实验中表现最佳。在UBS8K和GSC数据集上，MM分别达到18.02%和3.25%的错误率(ER)，甚至优于使用全部可用标注数据训练得到的模型（后者分别为23.29%和4.94%）。RMM在ESC-10上取得最佳结果（12.00% ER），其次为FM（13.33% ER）。随后，我们探索将MM和RMM中使用的混合增强(mixup)策略引入DCT、MT和FM。在几乎所有场景中，混合增强均带来一致性性能提升。例如在GSC数据集上，FM在未使用/使用混合增强时的错误率分别为4.44%和3.31%。本文PyTorch代码将在论文接收后发布于https://github.com/Labbeti/SSLH。