Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them. However, since relying heavily on human instructions, i.e., annotations or priors, current contrastive learning-based VideoQA methods remains challenging to perform fine-grained visual-linguistic alignments. In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies, e.g., video, question, and answer as ternary players, to achieve fine-grained alignment for VideoQA task. Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts. Our TG-VQA outperforms existing state-of-the-art by a large margin (more than 5%) on long-term and short-term VideoQA datasets, verifying its effectiveness and generalization ability. Thanks to the guidance of game-theoretic interaction, our model impressively convergences well on limited data (${10}^4 ~videos$), surpassing most of those pre-trained on large-scale data ($10^7~videos$).
翻译:视频问答旨在通过推理视频内容与问题之间的对齐语义来回答关于视频内容的问题。然而,由于严重依赖人工指令(如标注或先验知识),当前基于对比学习的视频问答方法在实现细粒度视觉-语言对齐方面仍面临挑战。本文创新性地引入博弈论,通过模拟多参与者(如视频、问题和答案作为三元参与者)之间具有特定交互策略的复杂关系,以实现视频问答任务的细粒度对齐。具体而言,我们精心设计了一种针对视频问答的交互策略,该策略能够数学化生成细粒度视觉-语言对齐标签,而无需大量标注工作。我们的TG-VQA在长时和短时视频问答数据集上显著超越现有最先进方法(提升超过5%),验证了其有效性和泛化能力。得益于博弈论交互的引导,我们的模型在有限数据(10^4个视频)上表现出了出色的收敛性,超越了大多数在大型数据集(10^7个视频)上预训练的模型。