Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .
翻译:关键帧选择是一种为长视频问答(QA)提供可验证视觉证据的直接方式。查询所需的线索各不相同,而找到正确的帧取决于明确要寻找什么。现有的关键帧选择器要么根据单个查询对每一帧进行评分,要么将查询分解为由单一视觉工具评估的固定模式。我们提出了ToolMerge,一种基于分解与合并的关键帧检索方法:基于大语言模型(LLM)的规划器将查询分解为工具调用,并通过布尔运算符指定如何合并各工具的排名。为直接评估检索性能,我们构建了Molmo-2 Moments(M2M)基准,其中每个问题都通过构造锚定到特定时间区间。在问答、查询检索和字幕检索任务中,ToolMerge与先前的关键帧选择器相比具有竞争力,尤其在字幕检索任务上表现突出,该方法比其他方法提升5%。代码和数据可在https://github.com/michalsr/ToolMerge获取。