CoVR-R:Reason-Aware Composed Video Retrieval

Omkar Thawakar,Dmitry Demidov,Vaishnav Potlapalli,Sai Prasanna Teja Reddy Bogireddy,Viswanatha Reddy Gajjala,Alaa Mostafa Lasheen,Rao Muhammad Anwer,Fahad Khan

from arxiv, 9 Pages, 3 Figures

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.

翻译：组合式视频检索旨在根据给定的参考视频和文本修改描述，找到目标视频。先前的研究假设修改文本完全指定了视觉变化，忽略了编辑所产生的后效和隐含后果（例如运动、状态转换、视角或持续时间线索）。我们提出，成功的组合式视频检索需要对这些后效进行推理。我们引入了一种推理优先、零样本的方法，利用大型多模态模型来：（i）推断编辑所隐含的因果和时间后果，以及（ii）将生成的推理查询与候选视频对齐，而无需任务特定的微调。为了评估组合式视频检索中的推理能力，我们还提出了CoVR-Reason基准，该基准将每个（参考、编辑、目标）三元组与结构化的内部推理轨迹和具有挑战性的干扰项配对，这些干扰项需要预测后效而非关键词匹配。实验表明，我们的零样本方法在Recall@K指标上优于强检索基线，并且在隐含效应子集上表现尤为出色。我们的自动和人工分析证实了检索结果中更高的步骤一致性和效果真实性。我们的发现表明，将推理融入通用多模态模型，通过显式考虑因果和时间后效，能够实现有效的组合式视频检索。这减少了对任务特定监督的依赖，提升了对具有挑战性的隐含效应案例的泛化能力，并增强了检索结果的可解释性。这些结果指向了一个可扩展且原则性的可解释视频搜索框架。模型、代码和基准可在 https://github.com/mbzuai-oryx/CoVR-R 获取。