Augmentation and knowledge distillation (KD) are well-established techniques employed in the realm of audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.
翻译:数据增强和知识蒸馏(KD)是音频分类任务中广泛应用的成熟技术,旨在提升在广泛认可的音频集(AS)基准上的性能并减小模型规模。尽管这两种技术各自有效,但其联合使用(称为一致性教学)此前尚未被探索。本文提出CED,一个通过一致性教学从大型教师集成中蒸馏学生模型的简单训练框架。为此,CED高效地将logits以及增强方法存储在磁盘上,使其可扩展至大规模数据集。CED有效性的核心在于其无标签特性,即仅使用存储的logits来优化学生模型,且仅占用AS额外0.3%的磁盘空间。本研究训练了多种基于Transformer的模型,其中包含一个在AS上达到49.0平均精度均值(mAP)的10M参数模型。预训练模型和代码已开源至 https://github.com/RicherMans/CED。