Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at https://github.com/acherstyx/CoCap.
翻译:现有的视频字幕生成方法通常需要先从解码后的视频中采样视频帧,再进行后续处理(如特征提取和/或字幕模型学习)。在这种流程中,人工帧采样可能忽略视频中的关键信息,从而降低性能。此外,采样帧中的冗余信息可能导致视频字幕推理效率低下。针对这一问题,我们从压缩域的不同视角研究视频字幕生成,这带来了相对于现有流程的多重优势:1)与解码视频的原始图像相比,由I帧、运动矢量和残差组成的压缩视频具有高度可区分性,这使得我们能够通过专门设计的模型利用整个视频进行学习,而无需人工采样;2)字幕模型在处理更小、更少冗余信息时推理效率更高。我们提出了一种简单而有效的压缩域端到端变换器用于视频字幕生成,使其能够从压缩视频中学习字幕生成。我们证明,即使采用简单设计,我们的方法在不同基准测试上也能达到最先进的性能,同时运行速度比现有方法快近2倍。代码可从https://github.com/acherstyx/CoCap 获取。