Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.
翻译:非语言发声(NV),如笑声、叹息和咳嗽,是表达情感与意图的重要声学线索。现有语音质量评估方法通常聚焦于整体自然度,而非语言TTS评估主要检验目标NV是否以正确的类型和位置出现。然而,NV事件本身的感知质量仍未被充分探索。为填补这一空白,我们构建了NV-MOS数据集,包含来自多个NV-TTS系统的输出及自然发生的NV样本,并由三位声学专家基于感知质量量表进行评分。进一步分析Gemini等具备音频处理能力的多模态大语言模型后,我们发现其评分与专家评分存在明显不一致。这些结果表明,通用多模态模型无法可靠替代人工判断进行NV质量评估。据此,我们提出NVMOS——据我们所知,这是首个能可靠预测语音中NV事件感知质量的模型。实验结果显示,通过局部NV事件聚焦模块,NVMOS与人类MOS评分达到了专家级或更强的一致性。