Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we introduce FastAST, a framework that integrates Token Merging (ToMe) into the AST framework. FastAST enhances inference speed without requiring extensive retraining by merging similar tokens in audio spectrograms. Furthermore, during training, FastAST brings about significant speed improvements. The experiments indicate that FastAST can increase audio classification throughput with minimal impact on accuracy. To mitigate the accuracy impact, we integrate Cross-Model Knowledge Distillation (CMKD) into the FastAST framework. Integrating ToMe and CMKD into AST results in improved accuracy compared to AST while maintaining faster inference speeds. FastAST represents a step towards real-time, resource-efficient audio analysis.
翻译:音频分类模型,特别是音频频谱图Transformer(AST),在高效音频分析中发挥着关键作用。然而,在不影响准确性的前提下优化其效率仍然是一个挑战。本文介绍了FastAST,这是一个将令牌合并(ToMe)集成到AST框架中的方法。FastAST通过在音频频谱图中合并相似的令牌,无需大量重新训练即可提升推理速度。此外,在训练期间,FastAST也能带来显著的速度提升。实验表明,FastAST能够以对准确性影响最小的方式提高音频分类的吞吐量。为了减轻对准确性的影响,我们将跨模型知识蒸馏(CMKD)集成到FastAST框架中。将ToMe和CMKD集成到AST中,与原始AST相比,在保持更快推理速度的同时,实现了更高的准确性。FastAST代表了向实时、资源高效的音频分析迈进的一步。