This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
翻译:本文提出VideoLoom,一种用于时空联合理解的统一视频大语言模型。为促进细粒度时空定位能力的发展,我们构建了LoomData-8.7k——一个以人为中心、包含时序锚定与空间定位描述的视频数据集。基于此,VideoLoom在多项时空基准测试中取得了领先或极具竞争力的性能(例如在ReVOS指称视频目标分割任务上获得63.1的J&F分数,在Charades-STA时序定位任务上达到48.3的R1@0.7指标)。此外,我们提出了LoomBench,这是一个由时序、空间及组合型视频-问题对构成的新型基准测试,能够从多维度对视频大语言模型进行综合评估。这些成果共同构成了一套通用且有效的时空联合视频理解方案,为多模态智能领域树立了新标准。