While recent progress in video-text retrieval has been driven by the exploration of powerful model architectures and training strategies, the representation learning ability of video-text retrieval models is still limited due to low-quality and scarce training data annotations. To address this issue, we present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. Specifically, we first adopt a simple augmentation method, which generates self-similar data by randomly duplicating or dropping subwords and frames. In addition, inspired by the recent advancement in visual and language generative models, we propose a more powerful augmentation method through textual paraphrasing and video stylization using large language models (LLMs) and visual generative models (VGMs). Further, to bring richer information into video and text, we propose a hallucination-based augmentation method, where we use LLMs and VGMs to generate and add new relevant information to the original data. Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.
翻译:尽管近期视频-文本检索的进展得益于对强大模型架构与训练策略的探索,但由于低质量且稀缺的训练数据标注,视频-文本检索模型的表示学习能力仍受限制。为解决这一问题,我们提出了一种新颖的视频-文本学习范式HaVTR,通过增强视频与文本数据来学习更泛化的特征。具体而言,我们首先采用一种简单的增强方法,通过随机复制或丢弃子词与帧来生成自相似数据。此外,受视觉与语言生成模型最新进展的启发,我们提出了一种更强大的增强方法,即利用大型语言模型(LLMs)与视觉生成模型(VGMs)进行文本释义与视频风格化。进一步地,为向视频与文本注入更丰富的信息,我们提出了一种基于幻觉的增强方法,利用LLMs与VGMs生成并添加与原始数据相关的新信息。得益于增强后的数据,在多个视频-文本检索基准上的大量实验表明,HaVTR的性能优于现有方法。