Prosody plays a central role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.
翻译:韵律在讽刺感知中发挥着核心作用,然而以往研究依赖于自然语音,无法对单个声学维度进行精细控制。由于自然数据中韵律线索存在共变关系,分离其独立贡献仍具挑战性。我们引入了一个受控框架,利用基于提示的韵律条件化神经文本转语音(TTS)技术来操控语速、音高变化和响度。通过构建正交刺激集,实现了对韵律线索效果的因果检验。人类聆听者对讽刺性和自然度进行了评分,并将其判断结果与能够处理音频输入的基础模型的预测进行了对比。结果显示,响度主要驱动人类的讽刺感知,而模型则赋予语速更大的权重,导致不同的线索权重模式。本研究展示了可控神经TTS如何促进语音感知中韵律线索权重的探究。