Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.

翻译：尽管大规模视频-语言模型（VLMs）的预训练在各种下游视频-语言任务中展现出显著潜力，现有VLMs仍存在若干常见缺陷，例如粗粒度的跨模态对齐、对时序动态建模不足、视频与语言视角分离等。本研究旨在通过细粒度结构化时空对齐学习方法（命名为Finsta）增强VLMs。首先，我们使用细粒度场景图（SG）结构分别表示输入文本与视频，并将其进一步整合为连接双模态的整体场景图（HSG）。随后构建基于SG的框架：文本场景图（TSG）采用图Transformer编码，而视频动态场景图（DSG）与HSG则通过新型循环图Transformer进行时空特征传播。为进一步强化对物体跨时空维度变化的感知，我们设计了时空高斯差分图Transformer。接着，基于TSG与DSG的细粒度结构特征，我们分别执行以物体为中心的空间对齐和以谓词为中心的时间对齐，从而在空间与时间维度同时增强视频-语言关联。本方法设计为即插即用系统，可集成至已有训练完备的VLMs中进行表征增强，无需从头训练或依赖下游应用中的SG标注。在标准与长视频场景下的12个数据集、6项代表性VL建模任务中，Finsta持续提升了现有13个高性能VLMs的表现，并在微调与零样本设定下显著刷新了当前端任务的最优性能。