In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.
翻译:近年来,大型语言模型在信息检索领域取得了快速进展,但现有研究主要集中在文本或静态多模态场景。开放域视频镜头检索涉及更丰富的时间结构和更复杂的语义,目前仍缺乏系统的基准测试与分析。为填补这一空白,我们提出了ShotFinder基准,该基准将编辑需求形式化为以关键帧为导向的镜头描述,并引入了五种可控的单因素约束类型:时序、色彩、视觉风格、音频和分辨率。我们从YouTube上20个主题类别中精选了1,210个高质量样本,采用大模型生成并辅以人工验证。基于此基准,我们提出了ShotFinder——一个文本驱动的三阶段检索与定位流程:(1)通过视频想象进行查询扩展,(2)利用搜索引擎检索候选视频,(3)基于描述的时序定位。在多个闭源与开源模型上的实验表明,其性能与人类水平存在显著差距,且不同约束类型间存在明显的不均衡性:时序定位相对容易处理,而色彩与视觉风格仍是主要挑战。这些结果表明,开放域视频镜头检索仍是多模态大模型亟待攻克的关键能力。