Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at https://github.com/acherstyx/CoCap.
翻译:现有视频字幕生成方法通常需要先从解码视频中采样视频帧,然后进行后续处理(例如特征提取和/或字幕模型学习)。在该流程中,人工帧采样可能忽略视频中的关键信息,从而导致性能下降。此外,采样帧中的冗余信息可能导致视频字幕生成推理效率低下。针对这一问题,我们从压缩域的不同视角研究视频字幕生成,这相比现有流程具有多重优势:1)与解码视频中的原始图像相比,由I帧、运动向量和残差组成的压缩视频具有高度可区分性,这使我们能够通过专门设计的模型在不进行人工采样的前提下,利用整个视频进行学习;2)由于处理的信息量更小且冗余更少,字幕生成模型的推理效率更高。我们提出了一种简单而有效的端到端压缩域Transformer用于视频字幕生成,使得模型能够从压缩视频中学习生成字幕。结果表明,即使采用简单设计,我们的方法在不同基准测试上也能达到最优性能,同时运行速度比现有方法快约2倍。代码开源地址:https://github.com/acherstyx/CoCap。