Large language models (LLMs) prompted with text and audio represent the state of the art in various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, these capabilities have yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior - tasks that are crucial for conservation, biodiversity monitoring, and the study of animal behavior. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our carefully curated training dataset comprises text-audio pairs spanning a diverse range of bioacoustics, speech, and music data, designed to address the challenges posed by limited annotated datasets in the field. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. Importantly, we test NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets the new state of the art (SotA) on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we also open-source the code for generating training and benchmark data, as well as for training the model.
翻译:大型语言模型(LLMs)在结合文本与音频提示后,已在语音、音乐及通用音频处理等多种听觉任务中展现出尖端性能,并在未见任务上表现出涌现能力。然而,这些能力尚未在生物声学任务中得到充分验证,例如大规模录音中的动物发声检测、稀有及濒危物种分类、以及行为与情境标注——这些任务对于生态保护、生物多样性监测及动物行为研究至关重要。本研究提出NatureLM-audio,这是首个专为生物声学设计的音频-语言基础模型。我们精心构建的训练数据集包含涵盖生物声学、语音和音乐领域的多样化文本-音频对,旨在应对该领域标注数据有限的挑战。我们证明了模型成功实现了从音乐和语音到生物声学的表征迁移,并在未见类群和任务上展现出良好的泛化能力。值得注意的是,我们在新型基准测试(BEANS-Zero)上评估NatureLM-audio,其在多项生物声学任务中创造了新的性能标杆(SotA),包括对未见物种的零样本分类。为推进生物声学研究,我们同时开源了用于生成训练与基准数据以及训练模型的代码。