CoVR-R:Reason-Aware Composed Video Retrieval

Omkar Thawakar,Dmitry Demidov,Vaishnav Potlapalli,Sai Prasanna Teja Reddy Bogireddy,Viswanatha Reddy Gajjala,Alaa Mostafa Lasheen,Rao Muhammad Anwer,Fahad Khan

from arxiv, CVPR 2026 (findings)

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.

翻译：组合视频检索（Composed Video Retrieval，CoVR）旨在根据参考视频与文本修改描述找到目标视频。现有研究假设修改文本完整指定了视觉变化，忽略了编辑引发的后续效应与隐含后果（如运动、状态转换、视角或持续时间线索）。本文认为，成功的CoVR需要对上述后续效应进行推理。我们提出一种推理优先的零样本方法，利用大型多模态模型：(i)推断编辑蕴含的因果与时间后果；(ii)将推理后的查询与候选视频对齐，无需任务特定微调。为评估CoVR中的推理能力，我们还提出了CoVR-Reason基准数据集，该数据集为每个（参考视频、编辑描述、目标视频）三元组配以结构化内部推理轨迹及挑战性干扰项——这些干扰项需预测后续效应而非关键词匹配。实验表明，我们的零样本方法在召回率K指标上优于强检索基线，并在隐式效应子集上表现尤为突出。自动与人工分析证实了检索结果具有更高的步序一致性与效应事实性。研究结果表明，将推理能力融入通用多模态模型可通过显式建模因果与时间后续效应实现有效的CoVR，从而减少对任务特定监督的依赖，提升对挑战性隐式效应案例的泛化能力，并增强检索结果的可解释性。这些成果指向一个可扩展且原则性的可解释视频搜索框架。模型、代码与基准数据集已发布于https://github.com/mbzuai-oryx/CoVR-R。