Vision-Language Models pre-trained on large-scale image-text datasets have shown superior performance in downstream tasks such as image retrieval. Most of the images for pre-training are presented in the form of open domain common-sense visual elements. Differently, video covers in short video search scenarios are presented as user-originated contents that provide important visual summaries of videos. In addition, a portion of the video covers come with manually designed cover texts that provide semantic complements. In order to fill in the gaps in short video cover data, we establish the first large-scale cover-text benchmark for Chinese short video search scenarios. Specifically, we release two large-scale datasets CBVS-5M/10M to provide short video covers, and the manual fine-labeling dataset CBVS-20K to provide real user queries, which serves as an image-text benchmark test in the Chinese short video search field. To integrate the semantics of cover text in the case of modality missing, we propose UniCLIP where cover texts play a guiding role during training, however are not relied upon by inference. Extensive evaluation on CBVS-20K demonstrates the excellent performance of our proposal. UniCLIP has been deployed to Tencent's online video search systems with hundreds of millions of visits and achieved significant gains. The dataset and code are available at https://github.com/QQBrowserVideoSearch/CBVS-UniCLIP.
翻译:在大规模图文数据集上预训练的视觉-语言模型已在图像检索等下游任务中展现出卓越性能。现有预训练图像大多以开放域通用视觉元素的形式呈现。然而,短视频搜索场景中的视频封面呈现为用户原创内容,为视频提供重要的视觉摘要。此外,部分视频封面带有手工设计的封面文本,提供语义补充。为填补短视频封面数据领域的空白,我们构建了首个面向中文短视频搜索场景的大规模封面-文本基准。具体而言,我们发布了两个大规模数据集CBVS-5M/10M用于提供短视频封面,以及人工精细标注数据集CBVS-20K用于提供真实用户查询,该数据集作为中文短视频搜索领域的图文基准测试。为解决模态缺失情况下封面文本的语义融合问题,我们提出了UniCLIP,其中封面文本在训练阶段起引导作用,但推理阶段并不依赖。在CBVS-20K上的广泛评估表明了我们方法的卓越性能。UniCLIP已部署至腾讯拥有数亿次访问的在线视频搜索系统并取得显著增益。数据集和代码已开源至https://github.com/QQBrowserVideoSearch/CBVS-UniCLIP。