Bioacoustics, the study of animal sounds, offers a non-invasive method to monitor ecosystems. Extracting embeddings from audio-pretrained deep learning (DL) models without fine-tuning has become popular for obtaining bioacoustic features for tasks. However, a recent benchmark study reveals that while fine-tuned audio-pretrained VGG and transformer models achieve state-of-the-art performance in some tasks, they fail in others. This study benchmarks 11 DL models on the same tasks by reducing their learned embeddings' dimensionality and evaluating them through clustering. We found that audio-pretrained DL models 1) without fine-tuning even underperform fine-tuned AlexNet, 2) both with and without fine-tuning fail to separate the background from labeled sounds, but ResNet does, and 3) outperform other models when fewer background sounds are included during fine-tuning. This study underscores the necessity of fine-tuning audio-pretrained models and checking the embeddings after fine-tuning. Our codes are available: https://github.com/NeuroscienceAI/Audio\_Embeddings
翻译:生物声学作为研究动物声音的学科,为生态系统监测提供了非侵入性方法。近年来,直接从音频预训练的深度学习模型提取嵌入表征(无需微调)已成为获取生物声学任务特征的流行方案。然而,最新基准研究表明,尽管经过微调的音频预训练VGG和Transformer模型在某些任务中达到了最先进性能,但在其他任务中却表现不佳。本研究通过对11个深度学习模型在相同任务上进行基准测试:首先降低其学习到的嵌入表征的维度,随后通过聚类方法进行评估。研究发现:1)未经微调的音频预训练深度学习模型甚至表现不及经过微调的AlexNet;2)无论是否经过微调,多数模型均无法有效分离背景音与标注声学事件,但ResNet模型例外;3)当微调过程中包含较少背景音时,这些模型优于其他模型。本研究强调了微调音频预训练模型及验证微调后嵌入表征质量的必要性。代码已开源:https://github.com/NeuroscienceAI/Audio_Embeddings