The interpretation of human voices holds importance across various applications. This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications. Voice analysis tech advancements span domains, from improving customer interactions to enhancing healthcare and retail experiences. Discerning emotions aids mental health, while age and gender detection are vital in various contexts. Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper. Sourcing suitable data posed challenges, resulting in the amalgamation of the CREMA-D and EMO-DB datasets. Prior work showed promise in individual predictions, but limited research considered all three variables simultaneously. This paper identifies flaws in an individual model approach and advocates for our novel multi-output learning architecture Speech-based Emotion Gender and Age Analysis (SEGAA) model. The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
翻译:人类语音的解读在各类应用场景中均具有重要意义。本研究致力于通过语音线索预测年龄、性别与情感,这一领域拥有广泛的应用前景。语音分析技术的进步跨越多个领域,从改善客户交互体验到优化医疗保健与零售服务。情感识别有助于心理健康评估,而年龄与性别检测在多种场景中至关重要。本文探讨了用于这些预测任务的深度学习模型,比较了单输出、多输出及序列化模型。在数据获取方面面临挑战,最终整合了CREMA-D与EMO-DB数据集。以往研究在单项预测任务中展现了潜力,但针对三者联合预测的探索有限。本文指出单项模型方法的缺陷,并提出新型多输出学习架构——基于语音的情感、性别与年龄分析(SEGAA)模型。实验表明,多输出模型在保持与单项模型相当性能的同时,能高效捕捉变量与语音输入之间的复杂关联,并显著提升运行效率。