Whispering is a ubiquitous mode of communication that humans use daily. Despite this, whispered speech has been poorly served by existing speech technology due to a shortage of resources and processing methodology. To remedy this, this paper provides a processing framework that enables access to large and unique data of high-quality whispered speech. We obtain the data from recordings submitted to online platforms as part of the ASMR media-cultural phenomenon. We describe our processing pipeline and a method for improved whispered activity detection (WAD) in the ASMR data. To efficiently obtain labelled, clean whispered speech, we complement the automatic WAD by using Edyson, a bulk audio-annotation tool with human-in-the-loop. We also tackle a problem particular to ASMR: separation of whisper from other acoustic triggers present in the genre. We show that the proposed WAD and the efficient labelling allows to build extensively augmented data and train a classifier that extracts clean whisper segments from ASMR audio. Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens opportunities in e.g. HCI as a resource that may elicit emotional, psychological and neuro-physiological responses in the listener.
翻译:耳语是人类日常使用的普遍交流方式。然而,由于缺乏相关资源和处理方法,现有语音技术对耳语音频的支持仍显不足。为解决这一问题,本文提出了一种处理框架,可访问高质量耳语音频中独特且大规模的数据集。我们从在线平台提交的ASMR媒体文化现象录音中获取数据,描述了处理流程及一种改进的ASMR数据耳语活动检测(WAD)方法。为高效获取带标注的纯净耳语音频,我们采用结合人工反馈的批量音频标注工具Edyson,对自动WAD进行补充。此外,我们针对ASMR特有的问题——将耳语与流派中其他声学触发信号分离——展开研究。实验表明,所提出的WAD与高效标注方法可构建大规模增强数据集,并训练分类器从ASMR音频中提取纯净耳语片段。本研究的持续增长数据集为支持耳语的语音技术及语言学分析提供了数据驱动基础,同时为人机交互等领域探索诱发听者情绪、心理及神经生理反应的研究资源开辟了新机遇。