ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

Tao Yu,Haopeng Jin,Hao Wang,Shenghua Chai,Yujia Yang,Junhao Gong,Jiaming Guo,Minghui Zhang,Xinlong Chen,Zhenghao Zhang,Yuxuan Zhou,Yanpei Gong,YuanCheng Liu,Yiming Ding,Kangwei Zeng,Pengfei Yang,Zhongtian Luo,Yufei Xiong,Shanbin Zhang,Shaoxiong Cheng,Huang Ruilin,Li Shuo,Yuxi Niu,Xinyuan Zhang,Yueya Xu,Jie Mao,Ruixuan Ji,Yaru Zhao,Mingchen Zhang,Jiabing Yang,Jiaqi Liu,YiFan Zhang,Hongzhu Yi,Xinming Wang,Cheng Zhong,Xiao Ma,Zhang Zhang,Yan Huang,Liang Wang

from arxiv, 28 pages, 7 figures

In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.

翻译：近年来，大型语言模型在信息检索领域取得了快速进展，但现有研究主要集中在文本或静态多模态场景。开放域视频镜头检索涉及更丰富的时间结构和更复杂的语义，目前仍缺乏系统的基准测试与分析。为填补这一空白，我们提出了ShotFinder基准，该基准将编辑需求形式化为以关键帧为导向的镜头描述，并引入了五种可控的单因素约束类型：时序、色彩、视觉风格、音频和分辨率。我们从YouTube上20个主题类别中精选了1,210个高质量样本，采用大模型生成并辅以人工验证。基于此基准，我们提出了ShotFinder——一个文本驱动的三阶段检索与定位流程：（1）通过视频想象进行查询扩展，（2）利用搜索引擎检索候选视频，（3）基于描述的时序定位。在多个闭源与开源模型上的实验表明，其性能与人类水平存在显著差距，且不同约束类型间存在明显的不均衡性：时序定位相对容易处理，而色彩与视觉风格仍是主要挑战。这些结果表明，开放域视频镜头检索仍是多模态大模型亟待攻克的关键能力。