The automatic classification of animal sounds presents an enduring challenge in bioacoustics, owing to the diverse statistical properties of sound signals, variations in recording equipment, and prevalent low Signal-to-Noise Ratio (SNR) conditions. Deep learning models like Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) have excelled in human speech recognition but have not been effectively tailored to the intricate nature of animal sounds, which exhibit substantial diversity even within the same domain. We propose an automated classification framework applicable to general animal sound classification. Our approach first optimizes audio features from Mel-frequency cepstral coefficients (MFCC) including feature rearrangement and feature reduction. It then uses the optimized features for the deep learning model, i.e., an attention-based Bidirectional LSTM (Bi-LSTM), to extract deep semantic features for sound classification. We also contribute an animal sound benchmark dataset encompassing oceanic animals and birds1. Extensive experimentation with real-world datasets demonstrates that our approach consistently outperforms baseline methods by over 25% in precision, recall, and accuracy, promising advancements in animal sound classification.
翻译:动物声音的自动分类在生物声学领域始终是一项持续挑战,这归因于声音信号多样的统计特性、录音设备的变化以及普遍存在的低信噪比(SNR)条件。诸如卷积神经网络(CNN)和长短期记忆网络(LSTM)等深度学习模型在人类语音识别方面表现出色,但尚未有效地适应动物声音的复杂特性,这些声音即使在相同领域内也表现出巨大的多样性。我们提出了一种适用于通用动物声音分类的自动分类框架。我们的方法首先优化来自梅尔频率倒谱系数(MFCC)的音频特征,包括特征重排和特征降维。然后,它使用优化后的特征输入深度学习模型,即一个基于注意力的双向长短期记忆网络(Bi-LSTM),以提取用于声音分类的深度语义特征。我们还贡献了一个涵盖海洋动物和鸟类的动物声音基准数据集。在真实数据集上进行的大量实验表明,我们的方法在精确率、召回率和准确率上始终优于基线方法超过25%,预示着动物声音分类领域的进步。