We propose an automatic data processing pipeline to extract vocal productions from large-scale natural audio recordings and classify these vocal productions. The pipeline is based on a deep neural network and adresses both issues simultaneously. Though a series of computationel steps (windowing, creation of a noise class, data augmentation, re-sampling, transfer learning, Bayesian optimisation), it automatically trains a neural network without requiring a large sample of labeled data and important computing resources. Our end-to-end methodology can handle noisy recordings made under different recording conditions. We test it on two different natural audio data sets, one from a group of Guinea baboons recorded from a primate research center and one from human babies recorded at home. The pipeline trains a model on 72 and 77 minutes of labeled audio recordings, with an accuracy of 94.58% and 99.76%. It is then used to process 443 and 174 hours of natural continuous recordings and it creates two new databases of 38.8 and 35.2 hours, respectively. We discuss the strengths and limitations of this approach that can be applied to any massive audio recording.
翻译:我们提出了一种自动数据处理流程,用于从大规模自然音频录音中提取发声并对其进行分类。该流程基于深度神经网络,可同时解决上述两个问题。通过一系列计算步骤(窗口化、噪声类别创建、数据增强、重采样、迁移学习、贝叶斯优化),它能够自动训练神经网络,而无需大量标注数据和重要计算资源。我们的端到端方法可处理在不同录音条件下录制的嘈杂音频。我们在两个不同的自然音频数据集上进行了测试:一个来自灵长类研究中心记录的几内亚狒狒群体,另一个来自家庭环境中记录的人类婴儿。该流程分别使用72分钟和77分钟的标注音频录音训练模型,准确率达到94.58%和99.76%。随后,它被用于处理443小时和174小时的自然连续录音,并分别创建了两个新数据库,时长分别为38.8小时和35.2小时。我们讨论了该方法的优势与局限性,该方法可应用于任何大规模音频录音。