CoarseSoundNet: Building a reliable model for ecological soundscape analysis

A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.

翻译：声景由三类声音组成：生物声（动物发出的声音）、地理声（自然非生物声音）和人类声（人类活动产生的声音）。声景生态学的一个核心研究问题是这些成分如何相互作用，特别是生物声如何响应地理声和人类声。然而，目前尚缺乏能够分别量化这些元素的分析工具。近年来的机器学习方法旨在支持自动化分析，但通常依赖于特定任务或干净数据，这限制了其在嘈杂被动声学监测（PAM）记录上的泛化能力。本研究提出了一种清晰且可复现的框架，用于构建粗粒度声景分类的机器学习模型，并引入了CoarseSoundNet——一种在真实PAM条件下训练用以区分生物声、地理声和人类声的深度学习模型。我们系统地研究了模型架构、额外训练类别的影响、数据组成及评估策略。实验结果表明，增加PAM数据（尤其是与目标域相似的数据）、在训练中引入明确的静默类别均可提升模型性能。此外，类别特定的决策阈值和基于持续时间的约束进一步增强了性能，尤其是对人类声和地理声的识别。误差分析揭示了人类声因掩蔽效应面临的挑战，以及静默类与昆虫声在地理声和生物声分类中的混淆问题。最终，我们通过生态案例研究表明，使用CoarseSoundNet对录音进行预滤波可得到与真实标签滤波相近的声学指数趋势，这支持了其作为生态声学分析有效预处理工具的应用。