Multimodal Sentiment Analysis based on Video and Audio Inputs

from arxiv, Presented as a full paper in the 15th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2024) October 28-30, 2024, Leuven, Belgium

Despite the abundance of current researches working on the sentiment analysis from videos and audios, finding the best model that gives the highest accuracy rate is still considered a challenge for researchers in this field. The main objective of this paper is to prove the usability of emotion recognition models that take video and audio inputs. The datasets used to train the models are the CREMA-D dataset for audio and the RAVDESS dataset for video. The fine-tuned models that been used are: Facebook/wav2vec2-large for audio and the Google/vivit-b-16x2-kinetics400 for video. The avarage of the probabilities for each emotion generated by the two previous models is utilized in the decision making framework. After disparity in the results, if one of the models gets much higher accuracy, another test framework is created. The methods used are the Weighted Average method, the Confidence Level Threshold method, the Dynamic Weighting Based on Confidence method, and the Rule-Based Logic method. This limited approach gives encouraging results that make future research into these methods viable.

翻译：尽管当前针对视频与音频情感分析的研究众多，但寻找能够提供最高准确率的最佳模型仍是该领域研究者面临的挑战。本文的主要目标是验证基于视频与音频输入的情感识别模型的实用性。训练模型所使用的数据集包括音频CREMA-D数据集与视频RAVDESS数据集。采用的微调模型分别为：音频处理使用Facebook/wav2vec2-large模型，视频处理使用Google/vivit-b-16x2-kinetics400模型。决策框架通过计算两个模型生成各类情绪概率的平均值来实现。当结果出现显著差异时（若某一模型准确率明显更高），则构建另一测试框架。所采用的方法包括加权平均法、置信度阈值法、基于置信度的动态加权法以及基于规则的逻辑方法。这种有限度的研究路径取得了令人鼓舞的成果，为后续深入探索这些方法提供了可行性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/