Movie story analysis requires understanding characters' emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset, we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted state-of-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx's self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues.
翻译:电影故事分析需要理解角色的情感与心理状态。为此,我们将情感理解任务定义为在电影场景层面及每个角色维度上预测一组多样且多标签的情感。我们提出了EmoTx,一种基于多模态Transformer的架构,能够整合视频、多个角色及对话语句以进行联合预测。通过利用MovieGraphs数据集的标注,我们旨在预测经典情感(如快乐、愤怒)及其他心理状态(如诚实、乐于助人)。我们在最常出现的10个和25个标签上开展实验,并构建了一种将181个标签聚类为26个标签的映射。消融实验及与当前最优情感识别方法的对比表明EmoTx的有效性。分析EmoTx的自注意力分数发现,表达性情感常聚焦于角色特征,而其他心理状态则依赖于视频与对话线索。