Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the "imaging-compression-decoding-and-then-captioning" pipeline, where compression is pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, which can be captured by a snapshot compressive sensing camera and we dub our model SnapCap. To be more specific, benefiting from the signal simulation, we have access to obtain abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill the knowledge from videos via a pre-trained CLIP with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on two widely-used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines. In particular, compared to the "caption-after-reconstruction" methods, our SnapCap can run at least 3$\times$ faster, and achieve better caption results.
翻译:视频字幕生成(VC)是一项具有挑战性的多模态任务,因为它需要通过对复杂多样的视频内容进行理解,进而用语言描述场景。传统VC方法遵循“成像-压缩-解码-再字幕”的流程,其中压缩对存储与传输至关重要。然而,此类流程存在难以避免的潜在缺陷,例如信息冗余导致效率低下,以及字幕生成过程中采样导致的信息损失。为解决这些问题,本文提出一种新型VC流程,可直接从压缩测量值生成字幕——该测量值由快照压缩感知相机捕获,我们将模型命名为SnapCap。具体而言,借助信号模拟技术,我们能够为模型获取丰富的测量值-视频-标注数据对。此外,为从压缩测量值中更好提取与语言相关的视觉表征,我们通过预训练的CLIP模型(具备丰富的语言-视觉关联能力)从视频中蒸馏知识,以指导SnapCap的学习过程。为验证SnapCap的有效性,我们在两个广泛使用的VC数据集上开展实验。定量与定性结果均证实,本流程优于传统VC流程。特别地,与“先重建后字幕”的方法相比,SnapCap的运行速度至少提升3倍,且能获得更优的字幕生成结果。