We present Q-ViD, a simple approach for video question answering (video QA), that unlike prior methods, which are based on complex architectures, computationally expensive pipelines or use closed models like GPTs, Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions. Specifically, we create captioning instruction prompts that rely on the target questions about the videos and leverage InstructBLIP to obtain video frame captions that are useful to the task at hand. Subsequently, we form descriptions of the whole video using the question-dependent frame captions, and feed that information, along with a question-answering prompt, to a large language model (LLM). The LLM is our reasoning module, and performs the final step of multiple-choice QA. Our simple Q-ViD framework achieves competitive or even higher performances than current state of the art models on a diverse range of videoQA benchmarks, including NExT-QA, STAR, How2QA, TVQA and IntentQA.
翻译:我们提出Q-ViD,一种用于视频问答(video QA)的简单方法。与先前基于复杂架构、计算密集型流程或使用GPT等封闭模型的方法不同,Q-ViD仅依赖单一的指令感知开放视觉语言模型(InstructBLIP),通过帧描述来解决视频问答。具体而言,我们构建了依赖于视频目标问题的描述性指令提示,并利用InstructBLIP获取对当前任务有用的视频帧描述。随后,我们利用与问题相关的帧描述形成整个视频的描述,并将这些信息连同问答提示一起输入大语言模型(LLM)。LLM作为推理模块,执行多项选择问答的最终步骤。我们简单的Q-ViD框架在多种视频问答基准测试(包括NExT-QA、STAR、How2QA、TVQA和IntentQA)上,取得了与当前最先进模型相媲美甚至更高的性能。