Prosody plays an important role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.
翻译:韵律在讽刺性感知中扮演重要角色,然而以往研究依赖自然语音,缺乏对单个声学维度的精细控制。由于自然数据中韵律线索共同变化,隔离其独立贡献仍具挑战性。我们引入了一种受控框架,利用基于提示的韵律条件化神经文本到语音(TTS)技术来操控语速、音高变化和响度。通过构建正交刺激集,实现了对韵律线索效应的因果检验。人类听众对讽刺性和自然度进行评分,其结果与可处理音频输入的基座模型的预测进行了对比。结果显示,响度主要驱动人类对讽刺性的感知,而模型则赋予语速更高的权重,从而形成了不同的线索加权模式。本研究表明,可控的神经TTS技术如何能用于探究语音感知中的韵律线索加权。