Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.
翻译:被动声学监测已成为生物多样性评估、保护和行为生态学的关键策略,尤其是在物联网设备能够实现大规模连续原位音频采集的背景下。尽管近期基于自监督学习的音频编码器(如BEATs和AVES)在生物声学任务中表现出优异性能,但其计算成本高昂且对未知环境的鲁棒性有限,阻碍了在资源受限平台上的部署。本研究提出BioME,一种专为生物声学应用设计的资源高效音频编码器。BioME通过从高容量教师模型进行逐层蒸馏训练,在减少75%参数量的同时实现了强表征迁移能力。为提升生态泛化性能,该模型在跨语音、环境声音和动物发声的多领域数据上进行预训练。核心创新在于通过FiLM条件机制整合调制感知声学特征,注入受数字信号处理启发的归纳偏置,从而增强低容量体系下的特征解耦能力。在多项生物声学任务中,BioME达到或超越了包括其教师模型在内的大型模型性能,同时适用于资源受限的物联网部署。为保障可复现性,代码与预训练模型已公开提供。