The intersection of technology and mental health has spurred innovative approaches to assessing emotional well-being, particularly through computational techniques applied to audio data analysis. This study explores the application of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) models on wavelet extracted features and Mel-frequency Cepstral Coefficients (MFCCs) for emotion detection from spoken speech. Data augmentation techniques, feature extraction, normalization, and model training were conducted to evaluate the models' performance in classifying emotional states. Results indicate that the CNN model achieved a higher accuracy of 61% compared to the LSTM model's accuracy of 56%. Both models demonstrated better performance in predicting specific emotions such as surprise and anger, leveraging distinct audio features like pitch and speed variations. Recommendations include further exploration of advanced data augmentation techniques, combined feature extraction methods, and the integration of linguistic analysis with speech characteristics for improved accuracy in mental health diagnostics. Collaboration for standardized dataset collection and sharing is recommended to foster advancements in affective computing and mental health care interventions.
翻译:技术与心理健康的交叉领域催生了评估情绪健康的创新方法,特别是通过应用于音频数据分析的计算技术。本研究探讨了卷积神经网络(CNN)和长短期记忆(LSTM)模型在小波提取特征和梅尔频率倒谱系数(MFCCs)上,用于语音情感检测的应用。通过数据增强技术、特征提取、归一化和模型训练,评估了模型在分类情绪状态方面的性能。结果表明,CNN模型达到了61%的较高准确率,而LSTM模型的准确率为56%。两种模型在预测特定情绪(如惊讶和愤怒)方面均表现出更好的性能,利用了如音高和语速变化等不同的音频特征。建议包括进一步探索先进的数据增强技术、组合特征提取方法,以及将语言分析与语音特征相结合,以提高心理健康诊断的准确性。建议开展合作以建立标准化的数据集收集与共享机制,从而推动情感计算和心理健康护理干预措施的进步。