This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
翻译:本文并未引入全新的方法,而是为视频时间定位——视频理解中的一项核心能力,建立了一个直接、渐进但至关重要的基线。尽管多模态大语言模型在各种视频理解任务中表现出色,但针对视频时间定位进行优化的方法仍未得到充分探索。本文提出了TimeLens,从数据质量和算法设计两个主要维度,系统性研究了如何构建具有强大视频时间定位能力的多模态大语言模型。我们首先揭示了现有视频时间定位基准中的关键质量问题,并引入了TimeLens-Bench,该基准包含对三个流行基准的严格重新标注版本,且遵循严格的质量标准。我们的分析表明,与原有基准相比,模型排名发生了显著变化,证实了先前评估标准的不可靠性。我们还通过自动重标注流程处理了有噪声的训练数据,由此构建了大规模高质量训练数据集TimeLens-100K。在数据基础上,我们深入探索了算法设计原则,获得了一系列有意义的见解以及有效且高效的实践方案。这些包括用于时间表示的交叉文本编码、一种基于可验证奖励的无思考强化学习方法作为训练范式,以及精心设计的强化学习训练方案。这些努力最终催生了TimeLens模型系列,该系列多模态大语言模型在开源模型中达到了最先进的视频时间定位性能,甚至超越了GPT-5和Gemini-2.5-Flash等专有模型。所有代码、数据和模型将公开发布,以促进未来研究。