EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Speech emotion recognition (SER) systems are constrained by existing datasets that typically cover only 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting sensitive emotional states. We introduce EMONET-VOICE, a comprehensive resource addressing these limitations through two components: (1) EmoNet-Voice Big, a 5,000-hour multilingual pre-training dataset spanning 40 fine-grained emotion categories across 11 voices and 4 languages, and (2) EmoNet-Voice Bench, a rigorously validated benchmark of 4,7k samples with unanimous expert consensus on emotion presence and intensity levels. Using state-of-the-art synthetic voice generation, our privacy-preserving approach enables ethical inclusion of sensitive emotions (e.g., pain, shame) while maintaining controlled experimental conditions. Each sample underwent validation by three psychology experts. We demonstrate that our Empathic Insight models trained on our synthetic data achieve strong real-world dataset generalization, as tested on EmoDB and RAVDESS. Furthermore, our comprehensive evaluation reveals that while high-arousal emotions (e.g., anger: 95% accuracy) are readily detected, the benchmark successfully exposes the difficulty of distinguishing perceptually similar emotions (e.g., sadness vs. distress: 63% discrimination), providing quantifiable metrics for advancing nuanced emotion AI. EMONET-VOICE establishes a new paradigm for large-scale, ethically-sourced, fine-grained SER research.

翻译：语音情感识别系统受限于现有数据集，这些数据集通常仅覆盖6-10种基本情感，缺乏规模与多样性，且在收集敏感情感状态时面临伦理挑战。我们提出了EMONET-VOICE，这是一个通过两个组成部分解决上述局限性的综合性资源：(1) EmoNet-Voice Big，一个包含5,000小时的多语言预训练数据集，涵盖11种音色和4种语言下的40个细粒度情感类别；(2) EmoNet-Voice Bench，一个包含4.7千个样本、经过严格验证的基准数据集，其中每个样本的情感存在与强度级别均获得了专家的一致共识。利用最先进的合成语音生成技术，我们的隐私保护方法能够在保持受控实验条件的同时，合乎伦理地纳入敏感情感（如痛苦、羞耻）。每个样本均经过三位心理学专家的验证。我们证明，基于我们合成数据训练的Empathic Insight模型在EmoDB和RAVDESS数据集上测试时，展现出强大的真实世界数据集泛化能力。此外，我们的全面评估表明，虽然高唤醒度情感（如愤怒：95%准确率）易于检测，但该基准也成功揭示了区分感知上相似情感的困难（如悲伤与痛苦：63%的区分度），从而为推进细致入微的情感人工智能提供了可量化的指标。EMONET-VOICE为大规模、符合伦理来源、细粒度的语音情感识别研究确立了新范式。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

多模态对话情感识别：方法、趋势、挑战与前景综述

专知会员服务

20+阅读 · 2025年5月28日

《大型语言模型情感认知》最新进展

专知会员服务

43+阅读 · 2024年10月3日

大型语言模型遇上文本中心的多模态情感分析：综述

专知会员服务

25+阅读 · 2024年6月13日

揭秘ChatGPT情感对话能力

专知会员服务

59+阅读 · 2023年4月9日