Temporal-logic video question answering requires a model to reason about when actions occur relative to one another, such as before, after, until, since, overlap, and multi-event chains, rather than merely what is present in a video. Standard vision-language models typically answer such questions in a single pass over a fixed, uniformly sampled set of frames, which is poorly matched to evidence that is often localized to narrow action boundaries or dispersed across several distant events. We present an evidence-seeking agent that treats temporal-logic VideoQA as active exploration. The agent follows a Think-Act-Observe loop driven by a multi-granular sampling toolkit, where every observation is interleaved with its absolute timestamp so that temporal relations reduce to numerical comparisons on a shared time axis. Its behavior is shaped by benchmark structure: a lightweight classifier routes each question to a temporal category, each with a tailored policy, iteration depth, and prompt, while sampling budgets adapt to corpus characteristics and clip length. The resulting training-free system couples Gemini 3.1 Pro with a temporal-reasoning policy and achieves 77.13 AvgAcc on the official TimeLogic test set.
翻译:时间逻辑视频问答要求模型推理事件之间的相对发生次序,例如前序、后序、持续、从属、重叠以及多事件链条,而非仅仅识别视频中存在的内容。标准视觉语言模型通常基于固定均匀采样的帧集合进行单次推理来回答此类问题,这与证据往往局限于狭窄的动作边界或分散在多个远距离事件中的特性不匹配。我们提出一种证据搜索智能体,将时间逻辑视频问答视为主动式探索。该智能体遵循由多粒度采样工具包驱动的“思考-行动-观察”循环,其中每次观察都与绝对时间戳交错记录,从而将时间关系简化为共享时间轴上的数值比较。其行为由基准数据集结构塑造:轻量级分类器将每个问题路由至特定时间类别,每种类别对应定制化的策略、迭代深度和提示词,同时采样预算自适应于语料库特性和视频片段长度。最终的无训练系统将Gemini 3.1 Pro与时间推理策略相结合,在官方TimeLogic测试集上达到77.13的平均准确率。