The generation of effective latent representations and their subsequent refinement to incorporate precise information is an essential prerequisite for Vision-Language Understanding (VLU) tasks such as Video Question Answering (VQA). However, most existing methods for VLU focus on sparsely sampling or fine-graining the input information (e.g., sampling a sparse set of frames or text tokens), or adding external knowledge. We present a novel "DRAX: Distraction Removal and Attended Cross-Alignment" method to rid our cross-modal representations of distractors in the latent space. We do not exclusively confine the perception of any input information from various modalities but instead use an attention-guided distraction removal method to increase focus on task-relevant information in latent embeddings. DRAX also ensures semantic alignment of embeddings during cross-modal fusions. We evaluate our approach on a challenging benchmark (SUTD-TrafficQA dataset), testing the framework's abilities for feature and event queries, temporal relation understanding, forecasting, hypothesis, and causal analysis through extensive experiments.
翻译:有效潜在表示的生成及其随后通过融入精确信息进行优化,是视频问答等视觉语言理解任务的基本前提。然而,现有视觉语言理解方法大多侧重于稀疏采样或精细化处理输入信息(例如,采样稀疏帧集或文本令牌),或引入外部知识。我们提出一种新颖的“DRAX:干扰移除与注意力交叉对齐”方法,以消除跨模态表示在潜在空间中的干扰因子。我们并非严格限制从不同模态感知任何输入信息,而是采用注意力引导的干扰移除方法,增强潜在嵌入中对任务相关信息的聚焦。DRAX还确保了跨模态融合过程中嵌入的语义对齐。我们通过大量实验,在具有挑战性的基准测试集(SUTD-TrafficQA数据集)上评估了该方法在特征与事件查询、时序关系理解、预测、假设及因果分析方面的能力。