animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Julian C. Schäfer-Zimmermann,Vlad Demartsev,Baptiste Averly,Kiran Dhanjal-Adams,Mathieu Duteil,Gabriella Gall,Marius Faiß,Lily Johnson-Ulrich,Dan Stowell,Marta B. Manser,Marie A. Roch,Ariana Strandburg-Peshkin

from arxiv, Code available at: https://github.com/livingingroups/animal2vec | Dataset available at: https://doi.org/10.17617/3.0J0DYB

Bioacoustic research provides invaluable insights into the behavior, ecology, and conservation of animals. Most bioacoustic datasets consist of long recordings where events of interest, such as vocalizations, are exceedingly rare. Analyzing these datasets poses a monumental challenge to researchers, where deep learning techniques have emerged as a standard method. Their adaptation remains challenging, focusing on models conceived for computer vision, where the audio waveforms are engineered into spectrographic representations for training and inference. We improve the current state of deep learning in bioacoustics in two ways: First, we present the animal2vec framework: a fully interpretable transformer model and self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. Second, we openly publish MeerKAT: Meerkat Kalahari Audio Transcripts, a large-scale dataset containing audio collected via biologgers deployed on free-ranging meerkats with a length of over 1068h, of which 184h have twelve time-resolved vocalization-type classes, each with ms-resolution, making it the largest publicly-available labeled dataset on terrestrial mammals. Further, we benchmark animal2vec against the NIPS4Bplus birdsong dataset. We report new state-of-the-art results on both datasets and evaluate the few-shot capabilities of animal2vec of labeled training data. Finally, we perform ablation studies to highlight the differences between our architecture and a vanilla transformer baseline for human-produced sounds. animal2vec allows researchers to classify massive amounts of sparse bioacoustic data even with little ground truth information available. In addition, the MeerKAT dataset is the first large-scale, millisecond-resolution corpus for benchmarking bioacoustic models in the pretrain/finetune paradigm. We believe this sets the stage for a new reference point for bioacoustics.

翻译：生物声学研究为理解动物的行为、生态与保护提供了宝贵见解。大多数生物声学数据集由长时间录音构成，其中感兴趣的事件（如发声）极为稀少。分析这些数据集对研究者构成巨大挑战，而深度学习技术已成为标准方法。但其适应仍具挑战性，主要依赖为计算机视觉设计的模型，需将音频波形处理成频谱图表示以进行训练与推理。我们从两方面改进当前生物声学深度学习的现状：首先，提出animal2vec框架——一个完全可解释的Transformer模型及专为稀疏不平衡生物声学数据设计的自监督训练方案。其次，公开发布MeerKAT（Meerkat Kalahari Audio Transcripts）大规模数据集，包含通过生物记录器在自由活动的猫鼬身上采集的音频，总时长超过1068小时，其中184小时标注了12类时间分辨的发声类型，且均具有毫秒级精度，使其成为当前公开的最大陆地哺乳动物标注数据集。此外，我们在NIPS4Bplus鸟类鸣叫数据集上对animal2vec进行基准测试。我们在两个数据集上均报告了新的最先进结果，并评估了animal2vec在少量标注训练数据下的少样本学习能力。最后，通过消融实验对比了我们的架构与面向人声的原始Transformer基线。animal2vec使研究者能够在仅有少量真实标注信息的情况下对海量稀疏生物声学数据进行分类。同时，MeerKAT数据集是首个适用于预训练/微调范式的大规模毫秒级精度生物声学模型基准语料库。我们相信这为生物声学领域确立了新的参考基准。