VLAP: Efficient Video-Language Alignment via Frame Prompting and Distilling for Video Question Answering

In this work, we propose an efficient Video-Language Alignment via Frame-Prompting and Distilling (VLAP) network. Our VLAP model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our VLAP network, we design a new learnable question-aware Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering. However, how to efficiently and effectively sample image frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our VLAP model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency (+3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our VLAP network outperforms (e.g. +4.6% on STAR Interaction and +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on VLEP with 4.2X speed up) the state-of-the-art methods on the video question-answering benchmarks.

翻译：摘要：本文提出了一种高效的视频-语言对齐网络——基于帧提示与蒸馏的VLAP模型。该模型以统一方式解决了高效帧采样与有效跨模态对齐的双重挑战。我们设计了可学习的问题感知帧提示器（Frame-Prompter）及新型跨模态蒸馏模块（QFormer-Distiller）。预训练大型图像-语言模型在视觉问答等问题上已展现良好性能，但如何高效且有效地采样图像帧以适应从预训练大模型到视频-语言对齐的迁移仍是主要挑战。相较于现有方法，VLAP模型能筛选包含关键内容的帧，从而在降低推理延迟的同时（在NExT-QA Temporal任务上提升3.3%且加速3.0倍）提升视频-语言对齐精度。总体而言，本模型在视频问答基准测试中超越当前最优方法（例如，在STAR Interaction任务上提升4.6%，在STAR平均性能上提升2.2%且加速3.0倍；在VLEP上，我们使用2帧的性能即可超越SeViLA使用4帧的结果，并实现4.2倍加速）。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日