In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency +3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2X speed-up.
翻译:本文提出了一种高效的视频-语言对齐网络(ViLA)。我们的ViLA模型以统一方式解决了高效帧采样与有效跨模态对齐两大问题。在该网络中,我们设计了一种新的可学习文本引导帧提示器(Frame-Prompter)并搭配新型跨模态蒸馏模块(QFormer-Distiller)。预训练的大规模图像-语言模型在视觉问答(VQA)等任务中已展现出优异性能,但当将其适配到视频-语言对齐任务时,如何高效且有效地采样视频帧仍是核心挑战。与现有工作相比,ViLA模型具备筛选含有关键信息的核心帧的能力,从而在提升视频-语言对齐精度的同时降低推理延迟(在NExT-QA Temporal数据集上实现+3.3%的准确率提升与3.0倍加速)。总体而言,我们的ViLA网络在视频问答基准测试中全面超越了现有最优方法:在STAR Interaction上提升+4.6%,在STAR平均性能上提升+2.2%且实现3.0倍加速;在VLEP数据集上,仅使用2帧输入即可超越使用4帧的SeViLA方法,同时实现4.2倍加速。