MyGardenBird: A Machine-Learning-Ready Bird Sound Dataset for Twelve Common Malaysian Birds

Bioacoustic datasets from tropical regions remain limited, in part due to the absence of reproducible workflows for aggregating recordings from public archives. We present \textbf{MyGardenBird}, a curated dataset of bird vocalisations representing twelve common species across Peninsular Malaysia and the Indo-Malayan region. Recordings were sourced from Xeno-canto and processed through species-level filtering, manual spectrogram segmentation, and quality control checks. The primary release comprises 7,200 manually validated audio clips (16 kHz, 16-bit PCM mono WAV), balanced at 600 three-second clips per species (6.0 hours total) derived from 1,381 distinct recordings. Metadata includes geospatial coordinates, vocalisation categories, and signal-to-noise ratio (SNR) values (range: 0.83--59.18 dB; mean: 15.80 dB). A supplementary 44.1 kHz version is also provided. To mitigate data leakage, dataset partitions are defined at the source-recording level. Baseline classification experiments using convolutional neural networks on Mel-spectrograms achieved test accuracies of 92--96\%, indicating strong interspecies separability. Limitations include reliance on single-annotator curation; however, validation with BirdNET confirmed label consistency. MyGardenBird is openly available at https://doi.org/10.5281/zenodo.20306877 under a CC BY-NC-SA 4.0 licence. Complete preprocessing code accompanies the release to support reproducibility and future expansion.

翻译：热带地区的生物声学数据集仍然有限，部分原因是缺乏从公共档案中汇总录音的可复现工作流程。我们提出了**MyGardenBird**，这是一个精选的鸟类发声数据集，涵盖了马来西亚半岛和印马地区的十二种常见物种。录音来源于Xeno-canto，并经过物种级别的筛选、人工语谱图分割和质量控制检查。主要发布版本包含7,200个经过人工验证的音频片段（16 kHz、16位PCM单声道WAV），每个物种600个三秒片段（总计6.0小时），源自1,381个不同的录音。元数据包括地理空间坐标、发声类别和信噪比（SNR）值（范围：0.83–59.18 dB；平均值：15.80 dB）。此外，还提供了一个44.1 kHz的补充版本。为减少数据泄露，数据集分区以源录音级别进行定义。使用卷积神经网络在梅尔频谱图上的基线分类实验达到了92–96%的测试准确率，表明物种间具有较强的可分性。局限性包括依赖单一标注者进行整理；然而，通过BirdNET进行的验证确认了标签的一致性。MyGardenBird以CC BY-NC-SA 4.0许可协议在https://doi.org/10.5281/zenodo.20306877 上公开提供。完整的预处理代码随发布一同提供，以支持可复现性和未来扩展。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《深度学习技术在海战舰船声景分类中的应用研究》最新63页

专知会员服务

28+阅读 · 2025年5月20日

《基于机器学习的冰上人为声源检测、分类、定位与跟踪》146页

专知会员服务

26+阅读 · 2024年5月28日

《用于语音取证和高超音速飞行器应用的机器学习》200页

专知会员服务

20+阅读 · 2024年3月28日

【博士论文】从噪声数据中深度学习的信息特征和示例的优先排序，94页pdf

专知会员服务

32+阅读 · 2024年3月11日