We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.
翻译:本文提出了一种用于音频频谱图Transformer的预训练流程,专门针对帧级声音事件检测任务。在常规预训练步骤的基础上,我们增加了基于AudioSet帧级标注的精细化训练方案,包括平衡采样器、强数据增强和集成知识蒸馏。针对五种Transformer模型,我们在AudioSet帧级预测和下游帧级声音事件检测任务中均取得了相较于已有检查点的显著性能提升,验证了本流程的有效性。我们公开发布了训练所得的检查点,研究者可直接通过微调构建高性能的声音事件检测模型。