Video Question Answering (VideoQA) models enhance understanding and interaction with audiovisual content, making it more accessible, searchable, and useful for a wide range of fields such as education, surveillance, entertainment, and content creation. Due to heavy compute requirements, most large visual language models (VLMs) for VideoQA rely on a fixed number of frames by uniformly sampling the video. However, this process does not pick important frames or capture the context of the video. We present a novel query-based selection of frames relevant to the questions based on the submodular mutual Information (SMI) functions. By replacing uniform frame sampling with query-based selection, our method ensures that the chosen frames provide complementary and essential visual information for accurate VideoQA. We evaluate our approach on the MVBench dataset, which spans a diverse set of multi-action video tasks. VideoQA accuracy on this dataset was assessed using two VLMs, namely Video-LLaVA and LLaVA-NeXT, both of which originally employed uniform frame sampling. Experiments were conducted using both uniform and query-based sampling strategies. An accuracy improvement of up to \textbf{4\%} was observed when using query-based frame selection over uniform sampling. Qualitative analysis further highlights that query-based selection, using SMI functions, consistently picks frames better aligned with the question. We opine that such query-based frame selection can enhance accuracy in a wide range of tasks that rely on only a subset of video frames.
翻译:视频问答模型增强了对视听内容的理解与交互,使其在教育、监控、娱乐和内容创作等广泛领域中更具可访问性、可搜索性和实用性。由于计算需求较高,大多数用于视频问答的大型视觉语言模型依赖固定数量的帧,通过对视频进行均匀采样来实现。然而,这一过程未能选取重要帧或捕捉视频的上下文信息。本文提出了一种基于查询的帧选择方法,利用子模互信息函数选取与问题相关的帧。通过用基于查询的选择替代均匀帧采样,我们的方法确保所选帧为准确的视频问答提供互补且必要的视觉信息。我们在MVBench数据集上评估了所提方法,该数据集涵盖多样化的多动作视频任务。使用两种视觉语言模型(即Video-LLaVA和LLaVA-NeXT)对该数据集的视频问答准确率进行评估,这两种模型原本均采用均匀帧采样策略。实验同时采用了均匀采样和基于查询的采样策略进行对比。实验结果表明,与均匀采样相比,基于查询的帧选择可使准确率提升高达\textbf{4\%}。定性分析进一步表明,基于SMI函数的查询选择方法能持续选取与问题更匹配的帧。我们认为,这种基于查询的帧选择方法可提升众多仅依赖视频帧子集的任务的准确性。