Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.
翻译:多模态大语言模型(MLLMs)在视频时间定位(VTG)任务上展现出强大性能。然而,其粗粒度识别能力不足以支撑细粒度的时间理解,使得任务特定微调不可或缺。这种微调导致模型记忆数据集特定捷径而非忠实锚定实际视觉内容,进而引发域外(OOD)泛化能力低下。以对象为中心的学习通过将场景分解为实体级表征提供了可行解决方案,但现有方法需从头重新运行完整的多阶段训练流程。为此,我们提出SlotVTG框架,该框架以最小代价引导MLLMs进行以对象为中心、基于输入内容的视觉推理。SlotVTG引入轻量级槽适配器,通过注意力机制将视觉令牌分解为抽象槽并重构原始序列,其中来自自监督视觉模型的物体先验促使语义连贯的槽形成。在标准VTG基准上的跨域评估表明,本方法在保持竞争性域内(ID)性能的同时,以极低开销显著提升了域外鲁棒性。