Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.
翻译:对比语言-图像预训练(CLIP)已在众多应用中得到广泛研究与应用。然而,预训练过程中对简短摘要文本的侧重,使得CLIP难以理解长描述。对于视频而言,这一问题尤为突出,因为视频通常包含丰富的细节内容。本文提出VideoCLIP-XL(eXtra Length)模型,旨在释放视频CLIP模型的长描述理解能力。首先,我们建立了一个自动数据收集系统,并构建了一个包含视频与长描述对的大规模VILD预训练数据集。其次,我们提出了文本相似性引导的主成分匹配(TPCM)方法,以在扩展长描述能力的同时更好地学习特征空间的分布。我们还引入了两个新任务,即细节感知描述排序(DDR)和幻觉感知描述排序(HDR),以进一步提升理解能力。最后,我们构建了一个长视频描述排序(LVDR)基准,用于更全面地评估长描述能力。在广泛使用的包含短描述和长描述的文本-视频检索基准以及我们提出的LVDR基准上进行的大量实验结果,充分证明了我们方法的有效性。