We propose a new two-stage pre-training framework for video-to-text generation tasks such as video captioning and video question answering: A generative encoder-decoder model is first jointly pre-trained on massive image-text data to learn fundamental vision-language concepts, and then adapted to video data in an intermediate video-text pre-training stage to learn video-specific skills such as spatio-temporal reasoning. As a result, our VideoOFA model achieves new state-of-the-art performance on four Video Captioning benchmarks, beating prior art by an average of 9.7 points in CIDEr score. It also outperforms existing models on two open-ended Video Question Answering datasets, showcasing its generalization capability as a universal video-to-text model.
翻译:我们提出了一种针对视频字幕生成和视频问答等视频到文本生成任务的新型两阶段预训练框架:首先生成式编码器-解码器模型在大规模图像-文本数据上进行联合预训练,以学习基础的视觉语言概念,随后在中间视频-文本预训练阶段将其适配到视频数据,以学习时空推理等视频专属技能。由此,我们的VideoOFA模型在四个视频字幕基准测试中取得了新的最佳性能,在CIDEr评分上平均领先先前技术9.7分。同时,该模型在两个开放式视频问答数据集上优于现有模型,展现了其作为通用视频到文本模型的泛化能力。