Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.
翻译:摘要:视频-文本变换器能否学习跨帧的时间关系?尽管其容量巨大且多模态训练数据丰富,但近期研究揭示了视频-文本模型强烈倾向于基于帧的空间表征,而时间推理仍基本未解。在本工作中,我们识别了视频-文本变换器时间学习中的若干关键挑战:有限网络规模导致的时空权衡;多帧建模的维度灾难;以及通过延长片段长度带来的语义信息收益递减。基于这些发现,我们提出了SViTT,一种稀疏视频-文本架构,其以显著低于使用密集注意力的朴素变换器的成本执行多帧推理。类似于基于图的网络,SViTT采用两种稀疏形式:边稀疏性,限制自注意力中标记之间的查询-键通信;以及节点稀疏性,丢弃无信息的视觉标记。通过采用一种课程学习策略,随片段长度增加模型稀疏性进行训练,SViTT在多个视频-文本检索和问答基准上以极低的计算成本超越了密集变换器基线。项目页面:http://svcl.ucsd.edu/projects/svitt。