The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meaning that classifiers cannot learn to discriminate such utterances from shouts typical of less hazardous situations, such as cheers. Thus, this paper presents a novel research source, the RItsumeikan Shout Corpus (RISC), which contains wide variety types of shouted speech samples collected in recording experiments. Each shouted speech sample in RISC has a shout type and is also assigned shout intensity ratings via a crowdsourcing service. We also present a comprehensive performance comparison among deep learning approaches for speech type classification tasks and a shout intensity prediction task. The results show that feature learning based on the spectral and cepstral domains achieves high performance, no matter which network architecture is used. The results also demonstrate that shout type classification and intensity prediction are still challenging tasks, and RISC is expected to contribute to further development in this research area.
翻译:喊话检测在音频监控与安全监测中至关重要。尽管安全系统具备识别紧急事件的能力是可取的,但现有语料库仅对每个语音样本提供二元标签(即喊话或正常),这使得难以预测呼喊强度。此外,多数语料库仅包含典型危险情境下的发声,导致分类器无法学习区分此类发声与危险性较低情境下的喊话(如欢呼声)。为此,本文提出一种新型研究资源——立命馆呼喊语料库(RISC),该语料库包含录音实验中收集的多种类型喊话语音样本。RISC中的每个喊话样本均标注了呼喊类型,并通过众包服务分配了呼喊强度评级。我们还在语音类型分类任务与呼喊强度预测任务中,对深度学习方法进行了全面的性能比较。结果表明,基于谱域和倒谱域的特征学习无论采用何种网络架构均能实现高性能。同时,结果也表明呼喊类型分类与强度预测仍是具有挑战性的任务,RISC有望为该研究领域的进一步发展做出贡献。