ViLA: Efficient Video-Language Alignment for Video Question Answering

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency +3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2X speed-up. The code will be available at https://github.com/xijun-cs/ViLA.

翻译：本文提出一种高效的视频-语言对齐（ViLA）网络。我们的ViLA模型以统一方式同时解决了高效帧采样与有效跨模态对齐两大挑战。在ViLA网络中，我们设计了新型可学习的文本引导帧提示器（Frame-Prompter）与跨模态蒸馏模块（QFormer-Distiller）。预训练大规模图像-语言模型在视觉问答（VQA）等任务中已展现出优异性能，但如何在对齐视频与语言时实现高效且有效的视频帧采样，仍是适应此类模型的主要难题。相较于现有方法，ViLA模型能够选择包含关键内容的核心视频帧，从而在降低推理延迟的同时提升视频-语言对齐精度（在NExT-QA Temporal数据集上准确率提升+3.3%，推理速度加快3.0倍）。总体而言，ViLA网络在视频问答基准测试中全面超越现有最优方法：在STAR Interaction上提升+4.6%，STAR平均指标提升+2.2%且加速3.0倍；在VLEP数据集上仅用2帧即可超越SeViLA模型4帧的性能，同时实现4.2倍加速。代码将在https://github.com/xijun-cs/ViLA 开源。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日