Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study

Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.

翻译：生物声学识别需要精细的声学理解以区分相似物种。然而，许多大规模数据仓库（如iNaturalist）标注较弱，通常每条录音仅含单个阳性物种标签，这使得监督学习尤为困难。受计算机视觉领域进展启发，近期研究转向自监督学习，旨在无需详尽标注的情况下捕捉音频的底层结构。特别地，掩码自编码器在大规模音频语料库上展现出强迁移能力，但其在更有限的生物声学场景中的有效性仍未充分探索。本研究系统性地探讨了MAE预训练在iNatSounds物种分类任务中的表现，分析了预训练数据规模、领域特异性、数据筛选及迁移策略的影响。与先前研究一致，我们发现基于多样化通用音频数据预训练的模型在iNatSounds上取得最佳迁移性能。然而，与大规模音频基准测试的观察结果相反，我们发现：（1）在领域特定数据上额外进行掩码重建预训练带来的益处有限，甚至可能比直接使用现成模型更差；（2）当总体数据规模受限时，选择性数据筛选几乎无显著优势。结果表明，在中等规模的细粒度生物声学场景中，预训练规模主导目标设计。这些发现进一步厘清了基于MAE的预训练何时有效，并为有限监督下的模型选择提供了实用指导。