Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech
翻译:当前语音到语音(S2S)模型虽能生成可理解的语音,但仍缺乏自然的表达力,这主要源于可靠评估指标的缺失。现有方法如主观平均意见得分(MOS)评级、低层级声学特征及情感识别等,存在成本高昂、适用范围有限或评估维度不完整等问题。为此,我们提出DeEAR(解码听觉表达偏好)框架,将人类对语音表达力的偏好转化为客观评分。基于语音学与心理学原理,DeEAR从情感、韵律和自然度三个维度评估语音,仅使用不足500个标注样本即实现与人类感知的高度对齐(斯皮尔曼等级相关系数SRCC=0.86)。除提供可靠评分外,DeEAR还能实现公平的基准测试和定向数据筛选。该框架不仅能有效区分不同S2S模型的表达力差距,还筛选出1.4万条高表现力语音样本构建ExpressiveSpeech数据集,使S2S模型的表达力评分(百分制)从2.0提升至23.4。演示与代码详见https://github.com/FreedomIntelligence/ExpressiveSpeech