Despite progress in video-language modeling, the computational challenge of interpreting long-form videos in response to task-specific linguistic queries persists, largely due to the complexity of high-dimensional video data and the misalignment between language and visual cues over space and time. To tackle this issue, we introduce a novel approach called Language-guided Spatial-Temporal Prompt Learning (LSTP). This approach features two key components: a Temporal Prompt Sampler (TPS) with optical flow prior that leverages temporal information to efficiently extract relevant video content, and a Spatial Prompt Solver (SPS) that adeptly captures the intricate spatial relationships between visual and textual elements. By harmonizing TPS and SPS with a cohesive training strategy, our framework significantly enhances computational efficiency, temporal understanding, and spatial-temporal alignment. Empirical evaluations across two challenging tasks--video question answering and temporal question grounding in videos--using a variety of video-language pretrainings (VLPs) and large language models (LLMs) demonstrate the superior performance, speed, and versatility of our proposed LSTP paradigm.
翻译:尽管视频-语言建模取得了进展,但针对任务特定语言查询理解长视频的计算挑战依然存在,这主要源于高维视频数据的复杂性以及语言与视觉线索在时空上的错位。为解决该问题,我们提出了一种名为语言引导时空提示学习(LSTP)的新方法。该方法包含两个关键组件:基于光流先验的时序提示采样器(TPS),利用时序信息高效提取相关视频内容;以及空间提示求解器(SPS),能够巧妙捕捉视觉与文本元素间复杂的空间关系。通过将TPS和SPS与协同训练策略相结合,我们的框架显著提升了计算效率、时序理解能力以及时空对齐效果。在视频问答和视频时序问题定位两项具有挑战性的任务上,基于多种视频-语言预训练模型(VLPs)和大语言模型(LLMs)的实验评估表明,我们提出的LSTP范式具有优越的性能、速度及通用性。