The objective of this work is to give patch-size flexibility to Audio Spectrogram Transformers (AST). Recent advancements in ASTs have shown superior performance in various audio-based tasks. However, the performance of standard ASTs degrades drastically when evaluated using different patch sizes from that used during training. As a result, AST models are typically re-trained to accommodate changes in patch sizes. To overcome this limitation, this paper proposes a training procedure to provide flexibility to standard AST models without architectural changes, allowing them to work with various patch sizes at the inference stage - FlexiAST. This proposed training approach simply utilizes random patch size selection and resizing of patch and positional embedding weights. Our experiments show that FlexiAST gives similar performance to standard AST models while maintaining its evaluation ability at various patch sizes on different datasets for audio classification tasks.
翻译:本文旨在为音频频谱图变换器(Audio Spectrogram Transformers,AST)赋予分块尺寸(patch-size)灵活性。近年来,AST在各种音频任务中展现出卓越性能。然而,当使用与训练时不同的分块尺寸进行评估时,标准AST的性能会显著下降。因此,AST模型通常需要重新训练以适应分块尺寸的变化。为克服此限制,本文提出一种训练流程,在不改变模型架构的前提下为标准AST模型赋予灵活性,使其在推理阶段能适用于不同分块尺寸——该模型称为FlexiAST。所提出的训练方法仅采用随机分块尺寸选择以及分块嵌入与位置嵌入权重的尺寸调整。实验表明,在音频分类任务的不同数据集上,FlexiAST在保持多种分块尺寸下评估能力的同时,性能与标准AST模型相当。