VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

翻译：大型视觉语言模型（LVLMs）在视频理解领域取得了显著进展，但在需要精确时空定位的实例级任务中仍面临重大挑战。现有方法主要依赖文本提示进行人机交互，但这些提示难以提供精确的时空参照，导致用户体验不佳。此外，当前方法通常将视觉感知与语言推理解耦，使推理过程围绕语言而非视觉内容展开，限制了模型主动感知细粒度视觉证据的能力。为解决这些问题，我们提出VideoSeeker——一种通过视觉提示进行实例级视频理解的新范式。VideoSeeker将智能体推理与实例级视频理解任务无缝整合，使模型能够主动按需感知并检索相关视频片段。我们构建了四阶段全自动数据合成流水线，高效生成大规模、高质量的实例级视频数据。通过冷启动监督和强化学习训练，将工具调用与主动感知能力内化至模型中，从而构建强大的视频理解模型。实验表明，我们的模型在实例级视频理解任务上相较基线方法平均提升+13.7%，超越GPT-4o和Gemini-2.5-Pro等强大闭源模型，并在通用视频理解基准测试中展现出有效迁移能力。相关数据集与代码将公开发布。