Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.
翻译:对比语言-图像预训练(CLIP)已在众多应用中得到广泛研究与应用。然而,预训练过程中对简短摘要文本的侧重,使得CLIP难以理解长描述。对于视频而言,这一问题尤为突出,因为视频通常包含丰富的细节内容。本文提出VideoCLIP-XL(eXtra Length)模型,旨在释放视频CLIP模型的长描述理解能力。首先,我们建立了一个自动数据收集系统,构建了一个大规模VILD预训练数据集,其中包含视频与长描述对。其次,我们提出了文本相似度引导的主成分匹配方法,以在扩展长描述能力的同时更好地学习特征空间的分布。我们还引入了两个新任务,即细节感知描述排序与幻觉感知描述排序,以进一步提升理解能力。最后,我们构建了长视频描述排序基准,用于更全面地评估长描述能力。在广泛使用的短描述与长描述文本-视频检索基准以及我们提出的LVDR基准上的大量实验结果,充分证明了我们方法的有效性。