Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., ``Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models using self-supervised representations outperform methods with classical acoustic features, particularly in capturing complex and dynamic impressions (e.g., ``Cold--Warm'') where classical features fail. In contrast, current MLLMs prove unreliable for this fine-grained pairwise task. This study provides the first systematic investigation of RIE and demonstrates the strength of self-supervised speech models in capturing subtle perceptual variations.
翻译:语音的副语言和非语言特征对听者印象具有显著影响。尽管现有研究多集中于绝对印象评分,本研究探讨了相对语音印象估计(RIE)框架,旨在预测同一说话者两段话语之间的感知差异。估计目标为基于主观评价导出的低维向量,用于量化第二段话语相对于第一段话语在反义轴(例如“低沉—明亮”)上的感知偏移。为分离表达性和韵律变化,我们采用专业朗读者以不同风格朗读同一文本的录音数据。我们比较了三种建模方法:常用于语音情感识别的经典声学特征、自监督语音表征以及多模态大语言模型(MLLM)。实验结果表明,采用自监督表征的模型在捕捉复杂动态印象(如“冷淡—温暖”)方面优于经典声学特征方法,而经典特征在此类任务中表现欠佳。相比之下,当前MLLM在此细粒度成对任务中可靠性不足。本研究首次对RIE进行了系统性探索,并验证了自监督语音模型在捕捉细微感知差异方面的优势。