This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically, the current video encoders tend to holistically embed all video clues at different granularities in a hierarchical manner, which inevitably introduces \textit{neighboring-frame redundancy} that can overwhelm detailed visual clues at the object level. Subsequently, prevailing vision-language fusion designs introduce the \textit{cross-modal redundancy} by exhaustively fusing all visual elements with question tokens without explicitly differentiating their pairwise vision-language interactions, thus making a pernicious impact on the answering. To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames, while adopting an out-of-neighboring message-passing scheme that imposes attention only on distant frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions by identifying a small subset of visual elements that exclusively support the answer. Upon these advancements, we find this \underline{R}edundancy-\underline{a}ware trans\underline{former} (RaFormer) can achieve state-of-the-art results on multiple VideoQA benchmarks.
翻译:本文识别了当前视频问答范式中的两种冗余。具体地,现有视频编码器倾向于以层级方式整体嵌入所有不同粒度的视频线索,这不可避免地引入了会淹没物体层面细节视觉线索的“邻近帧冗余”。随后,流行的视觉-语言融合设计通过无差别地融合所有视觉元素与问题词元,而未明确区分其成对视觉-语言交互,从而引入了“跨模态冗余”,对答案生成造成有害影响。为此,我们提出了一种新颖的基于Transformer的架构,旨在以冗余感知方式建模视频问答。为解决邻近帧冗余,我们引入了一种强调邻近帧间物体层级变化的视频编码器结构,同时采用非邻近信息传递方案,仅对远距离帧施加注意力。针对跨模态冗余,我们为融合模块配备了新颖的自适应采样机制,通过识别仅支持答案的少量视觉子集来明确区分视觉-语言交互。基于这些改进,我们发现这种元余感知Transformer(RaFormer)在多个视频问答基准上取得了最先进的结果。