In this work, we propose an efficient Video-Language Alignment via Frame-Prompting and Distilling (VLAP) network. Our VLAP model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our VLAP network, we design a new learnable question-aware Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering. However, how to efficiently and effectively sample image frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our VLAP model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency (+3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our VLAP network outperforms (e.g. +4.6% on STAR Interaction and +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on VLEP with 4.2X speed up) the state-of-the-art methods on the video question-answering benchmarks.
翻译:本文提出了一种高效视频-语言对齐网络VLAP(Video-Language Alignment via Frame-Prompting and Distilling)。该模型通过统一框架同时解决高效帧采样与有效跨模态对齐两大难题。我们在VLAP网络中设计了全新的可学习问题感知帧提示器(Frame-Prompter),并搭配跨模态蒸馏模块(QFormer-Distiller)。预训练大尺度图像-语言模型在视觉问答等问题上已展现出优异性能,但如何高效采样图像帧以适配预训练模型实现视频-语言对齐仍是主要挑战。相比现有方法,VLAP模型具备关键帧选择能力,在提升视频-语言对齐精度的同时降低推理延迟(NExT-QA Temporal任务提升3.3%,加速3.0倍)。总体而言,VLAP网络在视频问答基准测试中超越现有最优方法(如STAR Interaction任务提升4.6%,STAR平均提升2.2%,加速3.0倍;在VLEP数据集上,本方法2帧输入性能优于SeViLA 4帧输入,加速4.2倍)。