Despite the recent progress made in Video Question-Answering (VideoQA), these methods typically function as black-boxes, making it difficult to understand their reasoning processes and perform consistent compositional reasoning. To address these challenges, we propose a \textit{model-agnostic} Video Alignment and Answer Aggregation (VA$^{3}$) framework, which is capable of enhancing both compositional consistency and accuracy of existing VidQA methods by integrating video aligner and answer aggregator modules. The video aligner hierarchically selects the relevant video clips based on the question, while the answer aggregator deduces the answer to the question based on its sub-questions, with compositional consistency ensured by the information flow along question decomposition graph and the contrastive learning strategy. We evaluate our framework on three settings of the AGQA-Decomp dataset with three baseline methods, and propose new metrics to measure the compositional consistency of VidQA methods more comprehensively. Moreover, we propose a large language model (LLM) based automatic question decomposition pipeline to apply our framework to any VidQA dataset. We extend MSVD and NExT-QA datasets with it to evaluate our VA$^3$ framework on broader scenarios. Extensive experiments show that our framework improves both compositional consistency and accuracy of existing methods, leading to more interpretable real-world VidQA models.
翻译:尽管视频问答(VideoQA)领域近期取得了进展,但现有方法通常作为黑箱运行,难以理解其推理过程并保持一致的组合推理能力。为应对这些挑战,我们提出一种\textit{模型无关}的视频对齐与答案聚合(VA$^{3}$)框架,该框架通过集成视频对齐器和答案聚合器模块,能够同时提升现有视频问答方法的组合一致性与准确性。视频对齐器根据问题分层选择相关视频片段,而答案聚合器则基于子问题推导问题答案,通过沿问题分解图的信息流和对比学习策略确保组合一致性。我们在AGQA-Decomp数据集的三种设置下使用三种基线方法评估框架,并提出新指标以更全面地衡量视频问答方法的组合一致性。此外,我们提出基于大语言模型(LLM)的自动问题分解流程,使框架可应用于任何视频问答数据集。通过将其扩展至MSVD和NExT-QA数据集,我们在更广泛场景下评估VA$^3$框架。大量实验表明,该框架能同时提升现有方法的组合一致性与准确性,从而构建更具可解释性的现实世界视频问答模型。