Despite the abundance of current researches working on the sentiment analysis from videos and audios, finding the best model that gives the highest accuracy rate is still considered a challenge for researchers in this field. The main objective of this paper is to prove the usability of emotion recognition models that take video and audio inputs. The datasets used to train the models are the CREMA-D dataset for audio and the RAVDESS dataset for video. The fine-tuned models that been used are: Facebook/wav2vec2-large for audio and the Google/vivit-b-16x2-kinetics400 for video. The avarage of the probabilities for each emotion generated by the two previous models is utilized in the decision making framework. After disparity in the results, if one of the models gets much higher accuracy, another test framework is created. The methods used are the Weighted Average method, the Confidence Level Threshold method, the Dynamic Weighting Based on Confidence method, and the Rule-Based Logic method. This limited approach gives encouraging results that make future research into these methods viable.
翻译:尽管当前针对视频与音频情感分析的研究众多,但寻找能够提供最高准确率的最佳模型仍是该领域研究者面临的挑战。本文的主要目标是验证基于视频与音频输入的情感识别模型的实用性。训练模型所使用的数据集包括音频CREMA-D数据集与视频RAVDESS数据集。采用的微调模型分别为:音频处理使用Facebook/wav2vec2-large模型,视频处理使用Google/vivit-b-16x2-kinetics400模型。决策框架通过计算两个模型生成各类情绪概率的平均值来实现。当结果出现显著差异时(若某一模型准确率明显更高),则构建另一测试框架。所采用的方法包括加权平均法、置信度阈值法、基于置信度的动态加权法以及基于规则的逻辑方法。这种有限度的研究路径取得了令人鼓舞的成果,为后续深入探索这些方法提供了可行性。