This paper proposes a system capable of recognizing a speaker's utterance-level emotion through multimodal cues in a video. The system seamlessly integrates multiple AI models to first extract and pre-process multimodal information from the raw video input. Next, an end-to-end MER model sequentially predicts the speaker's emotions at the utterance level. Additionally, users can interactively demonstrate the system through the implemented interface.
翻译:本文提出一种能够通过视频中的多模态线索识别说话人话语级别情感的系统。该系统无缝集成多个AI模型,首先从原始视频输入中提取并预处理多模态信息。随后,端到端的多模态情感识别(MER)模型在话语级别上顺序预测说话人的情感状态。此外,用户可通过实现的交互式界面直观地演示该系统。