Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.
翻译:数据增强和知识蒸馏(KD)是音频分类任务中成熟的技术,旨在提升广泛使用的Audioset(AS)基准上的性能并减小模型规模。尽管这两种技术单独使用效果显著,但将其结合为一致性教学(consistent teaching)的方法此前尚未被探索。本文提出CED,一种简单的训练框架,通过一致性教学从大型教师集成中蒸馏学生模型。为此,CED将逻辑值与增强方法高效存储于磁盘,使其可扩展至大规模数据集。CED核心优势在于其无标签特性——仅利用存储的逻辑值优化学生模型,在AS上仅需额外0.3%的磁盘空间。本研究训练了多种基于Transformer的模型,其中包含一个在AS上达到49.0平均精度均值(mAP)的1000万参数模型。预训练模型与代码已开源至https://github.com/RicherMans/CED。