This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
翻译:本文通过分解的多阶段模块化推理框架解决视频问答任务。以往的模块化方法通常依赖一个未与视觉内容锚定的单一规划阶段,尽管展现出一定潜力。然而,通过一个简单有效的基线实验,我们发现此类系统在具有挑战性的视频问答场景中可能导致脆弱的实际表现。因此,与传统的单阶段规划方法不同,我们提出一个多阶段系统,包含事件解析器、锚定阶段和最终推理阶段,并结合外部记忆。所有阶段均无需训练,仅通过大模型的少样本提示完成,并在每个阶段生成可解释的中间输出。通过分解底层规划与任务复杂度,我们的方法MoReVQA在标准视频问答基准(NExT-QA、iVQA、EgoSchema、ActivityNet-QA)上取得领先成果,并扩展至相关任务(锚定视频问答、段落描述),性能超越先前工作。