Non-verbal Vocalizations (NVs), such as laughter and sighs, are vital for conveying emotion and intention in human speech, yet most existing speech systems neglect them, which severely compromises communicative richness and emotional intelligence. Existing methods for NVs acquisition are either costly and unscalable (relying on manual annotation/recording) or unnatural (relying on rule-based synthesis). To address these limitations, we propose a highly scalable automatic annotation framework to label non-verbal phenomena from natural speech, which is low-cost, easily extendable, and inherently diverse and natural. This framework leverages a unified detection model to accurately identify NVs in natural speech and integrates them with transcripts via temporal-semantic alignment method. Using this framework, we created and released \textbf{NonVerbalSpeech-38K}, a diverse, real-world dataset featuring 38,718 samples across 10 NV categories collected from in-the-wild media. Experimental results demonstrate that our dataset provides superior controllability for NVs generation and achieves comparable performance for NVs understanding.
翻译:非言语发声(如笑声、叹息)对于传递人类语音中的情感与意图至关重要,然而现有语音系统大多忽视此类现象,严重损害了交流的丰富性与情感智能。当前非言语发声的获取方法要么成本高昂且难以扩展(依赖人工标注/录制),要么缺乏自然度(依赖基于规则的合成)。为克服这些局限,我们提出一种高度可扩展的自动标注框架,能够从自然语音中标注非言语现象,该方法具有低成本、易扩展、天然多样且自然的特性。该框架采用统一的检测模型精准识别自然语音中的非言语发声,并通过时序-语义对齐方法将其与文本转录进行整合。基于此框架,我们构建并发布了\textbf{NonVerbalSpeech-38K}数据集——一个包含38,718个样本、涵盖10类非言语发声的多样化真实世界数据集,所有样本均采集自开放域媒体资源。实验结果表明,该数据集在非言语发声生成任务中展现出优异的可控性,同时在非言语发声理解任务中达到可比性能。