Bioacoustic datasets from tropical regions remain limited, in part due to the absence of reproducible workflows for aggregating recordings from public archives. We present \textbf{MyGardenBird}, a curated dataset of bird vocalisations representing twelve common species across Peninsular Malaysia and the Indo-Malayan region. Recordings were sourced from Xeno-canto and processed through species-level filtering, manual spectrogram segmentation, and quality control checks. The primary release comprises 7,200 manually validated audio clips (16 kHz, 16-bit PCM mono WAV), balanced at 600 three-second clips per species (6.0 hours total) derived from 1,381 distinct recordings. Metadata includes geospatial coordinates, vocalisation categories, and signal-to-noise ratio (SNR) values (range: 0.83--59.18 dB; mean: 15.80 dB). A supplementary 44.1 kHz version is also provided. To mitigate data leakage, dataset partitions are defined at the source-recording level. Baseline classification experiments using convolutional neural networks on Mel-spectrograms achieved test accuracies of 92--96\%, indicating strong interspecies separability. Limitations include reliance on single-annotator curation; however, validation with BirdNET confirmed label consistency. MyGardenBird is openly available at https://doi.org/10.5281/zenodo.20306877 under a CC BY-NC-SA 4.0 licence. Complete preprocessing code accompanies the release to support reproducibility and future expansion.
翻译:热带地区的生物声学数据集仍然有限,部分原因是缺乏从公共档案中汇总录音的可复现工作流程。我们提出了**MyGardenBird**,这是一个精选的鸟类发声数据集,涵盖了马来西亚半岛和印马地区的十二种常见物种。录音来源于Xeno-canto,并经过物种级别的筛选、人工语谱图分割和质量控制检查。主要发布版本包含7,200个经过人工验证的音频片段(16 kHz、16位PCM单声道WAV),每个物种600个三秒片段(总计6.0小时),源自1,381个不同的录音。元数据包括地理空间坐标、发声类别和信噪比(SNR)值(范围:0.83–59.18 dB;平均值:15.80 dB)。此外,还提供了一个44.1 kHz的补充版本。为减少数据泄露,数据集分区以源录音级别进行定义。使用卷积神经网络在梅尔频谱图上的基线分类实验达到了92–96%的测试准确率,表明物种间具有较强的可分性。局限性包括依赖单一标注者进行整理;然而,通过BirdNET进行的验证确认了标签的一致性。MyGardenBird以CC BY-NC-SA 4.0许可协议在https://doi.org/10.5281/zenodo.20306877 上公开提供。完整的预处理代码随发布一同提供,以支持可复现性和未来扩展。