In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.
翻译:本文提出VidLA,一种适用于大规模视频-语言对齐的方法。现有视频-语言对齐方法存在两大局限:其一,它们未能同时捕获短程与长程时间依赖关系,且通常采用难以与现有预训练图文基础模型集成的复杂层级深度网络架构。为有效解决该局限,我们保持网络架构简洁,采用一组以层级方式在不同时间分辨率下运行的数据令牌,从而契合视频固有的时间层级特性。通过使用简单的双塔架构,我们能够利用预训练图文基础模型初始化视频-语言模型,进而提升最终性能。其二,现有视频-语言对齐工作因缺乏语义对齐的大规模训练数据而面临挑战。为此,我们利用最新的大语言模型来构建迄今最大规模、且具有更优视觉定位能力的视频-语言数据集。此外,与仅包含短片段现有视频-文本数据集不同,本数据集包含不同时长的视频片段,以辅助时间层级数据令牌在多种时间尺度上提取更优表征。总体而言,实验结果表明,所提方法在多个检索基准(尤其针对长视频)上超越现有最优方法,并在分类基准上表现相当。