Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns: both frequently confuse the main video with the temporal hard negative and over-prefer temporally plausible but end-state-incorrect clips, indicating insufficient grounding to decisive end-state evidence, while being comparatively less sensitive to semantic substitutions. We further introduce triplet-based diagnostic analyses, including relative-order statistics and breakdowns across transition categories, to make temporal vs. semantic failure sources explicit. GenState-AI provides a focused testbed for state-aware, temporally and semantically sensitive text-to-video retrieval, and will be released on huggingface.co.
翻译:现有的文本-视频检索基准主要由真实世界视频片段主导,其大部分语义可以从单帧图像中推断,导致对时序推理和显式终态关联能力的评估不足。我们提出了GenState-AI,这是一个以受控状态转换为中心的AI生成基准数据集。其中每个查询均与一个主视频、一个仅在决定性终态上存在差异的时序困难负例,以及一个进行了内容替换的语义困难负例配对,从而能够在超越外观匹配的层面上,对时序混淆与语义混淆进行细粒度诊断。我们利用Wan2.2-TI2V-5B生成了短片段,其意义依赖于位置、数量和物体关系的精确变化,为状态感知检索提供了可控的评估条件。我们评估了两个代表性的基于MLLM的基线模型,并观察到一致且可解释的失败模式:两者均频繁混淆主视频与时序困难负例,并过度偏好时序合理但终态错误的片段,这表明模型对决定性终态证据的关联能力不足,同时对语义替换相对不敏感。我们进一步引入了基于三元组的诊断分析,包括相对顺序统计和跨转换类别的细分,以明确揭示时序与语义层面的失败来源。GenState-AI为状态感知、对时序和语义敏感的文本-视频检索提供了一个聚焦的测试平台,并将发布于huggingface.co。