Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs. However, this leads to performance degradation for ASTs in the inference when input lengths vary from the training. This paper introduces an approach that enables the use of variable-length audio inputs with AST models during both training and inference. By employing sequence packing, our method ElasticAST, accommodates any audio length during training, thereby offering flexibility across all lengths and resolutions at the inference. This flexibility allows ElasticAST to maintain evaluation capabilities at various lengths or resolutions and achieve similar performance to standard ASTs trained at specific lengths or resolutions. Moreover, experiments demonstrate ElasticAST's better performance when trained and evaluated on native-length audio datasets.
翻译:Transformer架构已迅速超越基于CNN的架构,成为音频分类领域的新标准。基于Transformer的模型(如音频频谱变换器AST)同样继承了CNN的固定尺寸输入范式。然而,这导致AST在推理阶段遇到与训练时不同的输入长度时会出现性能下降。本文提出一种方法,使AST模型在训练和推理阶段均能处理可变长度的音频输入。通过采用序列打包技术,我们提出的ElasticAST方法在训练阶段可适应任意音频长度,从而在推理阶段实现对所有长度和分辨率的灵活支持。这种灵活性使得ElasticAST能够在不同长度或分辨率下保持评估能力,并达到与在特定长度或分辨率下训练的标准AST相当的性能。此外,实验表明,当在原始长度的音频数据集上进行训练和评估时,ElasticAST展现出更优的性能表现。