Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.
翻译:时间定位——根据视频的自然语言查询返回区间 $[t_s, t_e]$——是长视频的语言接口,然而目前的研究仅限于短视频;小时级自然语言定位的动态机制仍未被充分探索。我们持这样的观点:在小时级尺度上,约束因素是搜索而非识别——视频大模型的瓶颈不在于定位附近的事件,而在于给定自然语言查询后,搜索长视频中的相关区域。为验证这一点,我们发布了ExtremeWhenBench,这是首个开放的小时级定位基准测试(涵盖194个视频中的2273个查询,平均时长75.7分钟,最长9小时),并采用开放式查询分布。所有开放视频大模型均表现不佳,而帧级检索基线却优于它们;失败分类法将85%的失败归因于搜索;检索后定位的混合方法性能比单一视频大模型提升了6.7倍——这类似于开放域问答中的“先检索后读取”模式。