Bioacoustic research, vital for understanding animal behavior, conservation, and ecology, faces a monumental challenge: analyzing vast datasets where animal vocalizations are rare. While deep learning techniques are becoming standard, adapting them to bioacoustics remains difficult. We address this with animal2vec, an interpretable large transformer model, and a self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. It learns from unlabeled audio and then refines its understanding with labeled data. Furthermore, we introduce and publicly release MeerKAT: Meerkat Kalahari Audio Transcripts, a dataset of meerkat (Suricata suricatta) vocalizations with millisecond-resolution annotations, the largest labeled dataset on non-human terrestrial mammals currently available. Our model outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset. Moreover, animal2vec performs well even with limited labeled data (few-shot learning). animal2vec and MeerKAT provide a new reference point for bioacoustic research, enabling scientists to analyze large amounts of data even with scarce ground truth information.
翻译:生物声学研究对于理解动物行为、物种保护和生态学至关重要,但其面临一项巨大挑战:如何在动物发声极为稀少的海量数据集中进行分析。尽管深度学习技术正成为标准工具,但将其应用于生物声学领域仍存在困难。为此,我们提出了animal2vec——一个可解释的大型Transformer模型,以及专门针对稀疏且不平衡生物声学数据设计的自监督训练方案。该模型首先从无标注音频中学习,随后利用标注数据细化其理解能力。此外,我们正式发布MeerKAT数据集(Meerkat Kalahari Audio Transcripts),这是一个包含毫秒级精度标注的猫鼬(Suricata suricatta)发声数据集,也是当前公开的最大规模非人类陆地哺乳动物标注数据集。我们的模型在MeerKAT数据集及公开的NIPS4Bplus鸟类鸣叫数据集上均优于现有方法。更值得注意的是,animal2vec在标注数据有限的情况下(少样本学习)仍表现出优异性能。animal2vec与MeerKAT为生物声学研究确立了新的参考基准,使科研人员能够在缺乏真实标注信息的情况下仍能有效分析大规模数据。