Nonverbal vocalizations (NVVs), such as laughing, sighing, and sobbing, are essential for human-like speech, yet standardized evaluation rarely jointly assesses whether systems generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present NVV-SuperBench, a bilingual English/Chinese benchmark for speech generation with NVVs. It provides a unified 45-type taxonomy and a multi-axis protocol beyond conventional speech quality assessment, evaluating NVV-specific controllability, placement, and perceptual salience. We benchmark 15 speech generation systems spanning prompt-based and tag-based control paradigms, using objective metrics, human listening tests, and LLM-based multi-rater evaluation. Results show that NVV controllability often decouples from speech quality, while low-SNR oral cues and long-duration affective NVVs remain bottlenecks. NVV-SuperBench highlights current gaps and supports progress toward more human-like speech generation.
翻译:非语言发声(如笑声、叹息声、抽泣声)对于拟人化语音至关重要,但现有标准化评估通常未能联合评估系统能否生成预期非语言发声、正确放置这些发声,并在不损害语音质量的前提下保持其显著性。我们提出NVV-SuperBench,一个面向带非语言发声的语音生成的双语(英文/中文)基准测试。该基准提供统一的45类型分类体系及超越传统语音质量评估的多维度协议,专门评估非语言发声的可控性、放置位置及感知显著性。我们采用客观指标、人工听测及基于大语言模型的多评分者评估,对涵盖提示词控制与标签控制范式的15个语音生成系统进行基准测试。结果表明,非语言发声的可控性常与语音质量解耦,而低信噪比口腔提示音与长时情感性非语言发声仍是瓶颈。NVV-SuperBench揭示了现有技术差距,并推动向更拟人化的语音生成发展。