Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question. However, such schema would incur significant memory use and inevitably slow down the training and inference speed. In this work, we present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we concatenate video frames to a $n\times n$ matrix and then convert it to one image. By doing so, we reduce the use of the image encoder from $n^{2}$ to $1$ while maintaining the temporal structure of the original video. Experimental results on MSRVTT and TrafficQA show that our proposed approach achieves state-of-the-art performance with nearly $4\times$ faster speed and only 30% memory use. We show that by integrating our approach into VideoQA systems we can achieve comparable, even superior, performance with a significant speed up for training and inference. We believe the proposed approach can facilitate VideoQA-related research by reducing the computational requirements for those who have limited access to budgets and resources. Our code will be made publicly available for research use.
翻译:传统基于Transformer的视频问答方法通常通过一个或多个图像编码器独立编码帧,随后进行帧与问题之间的交互。然而,这种模式会显著增加内存使用,并不可避免地降低训练和推理速度。本文提出了一种基于现有视觉-语言预训练模型的高效视频问答方法,该方法将视频帧拼接为n×n矩阵,再将其转换为一张图像。通过此方式,我们将图像编码器的使用从n²减少至1,同时保留原始视频的时间结构。在MSRVTT和TrafficQA数据集上的实验结果表明,我们提出的方法在实现最先进性能的同时,速度提升近4倍,内存使用仅需30%。研究表明,将本方法整合到视频问答系统中,可以在显著加速训练和推理的同时,获得相当甚至更优的性能。我们相信,该方法通过降低计算资源需求,能够促进预算和资源有限的研究者开展视频问答相关研究。相关代码将公开发布以供研究使用。