Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement.
翻译:大型预训练模型已在多个领域展现出卓越性能,其中预训练与微调阶段的一致性是其成功的关键。然而,在机器异常声音检测任务中,预训练模型鲜有取得令人满意结果的报道。这可能源于预训练模型与机器音频的归纳偏置之间存在不一致性,从而导致数据与架构层面的不匹配。为此,我们提出AnoPatch方法,该方法采用在AudioSet上预训练的ViT主干网络,并在机器音频数据上进行微调。我们认为机器音频与通用音频数据集的相关性高于语音数据集,且从音频片段层级建模更契合机器音频的稀疏特性。实验结果表明,AnoPatch在DCASE 2020 ASD数据集和DCASE 2023 ASD数据集上均取得了最先进的性能表现。我们通过对比多种预训练模型进一步验证:更好的一致性能够带来显著的性能提升。