FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.

翻译：情感视频描述是一项新兴任务，旨在结合视频所表达的内在情感来描述事实内容。现有方法通过感知全局情感线索，并将其与视频内容结合以生成描述。然而，由于生成过程中事实与情感线索挖掘及协调不足，现有方法难以处理事实-情感偏差问题，即不同样本在生成时对事实与情感的需求存在差异。为此，我们提出一种融合事实校准与情感增强的检索增强框架（FACE-net），该框架通过统一架构协同挖掘事实-情感语义，并为生成过程提供自适应且精准的引导，从而突破全样本学习中事实与情感描述相互妥协的倾向。技术上，我们首先引入外部语料库，检索与视频内容最相关的句子以增强语义信息。随后，基于不确定性估计的事实校准模块将检索信息拆分为主语-谓语-宾语三元组，并通过视频内容进行自校正与交叉校正，以有效挖掘事实语义；而渐进式视觉情感增强模块则以校准后的事实语义作为专家知识，与视频内容及情感词典交互以生成视觉查询和候选情感，进而将其聚合以自适应地为每个事实语义增强情感表达。此外，为缓解事实-情感偏差，我们设计了动态偏差调节路由模块，用于预测并调整样本的偏差程度。