HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models

While recent progress in video-text retrieval has been driven by the exploration of powerful model architectures and training strategies, the representation learning ability of video-text retrieval models is still limited due to low-quality and scarce training data annotations. To address this issue, we present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. Specifically, we first adopt a simple augmentation method, which generates self-similar data by randomly duplicating or dropping subwords and frames. In addition, inspired by the recent advancement in visual and language generative models, we propose a more powerful augmentation method through textual paraphrasing and video stylization using large language models (LLMs) and visual generative models (VGMs). Further, to bring richer information into video and text, we propose a hallucination-based augmentation method, where we use LLMs and VGMs to generate and add new relevant information to the original data. Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.

翻译：尽管近期视频-文本检索的进展得益于对强大模型架构与训练策略的探索，但由于低质量且稀缺的训练数据标注，视频-文本检索模型的表示学习能力仍受限制。为解决这一问题，我们提出了一种新颖的视频-文本学习范式HaVTR，通过增强视频与文本数据来学习更泛化的特征。具体而言，我们首先采用一种简单的增强方法，通过随机复制或丢弃子词与帧来生成自相似数据。此外，受视觉与语言生成模型最新进展的启发，我们提出了一种更强大的增强方法，即利用大型语言模型（LLMs）与视觉生成模型（VGMs）进行文本释义与视频风格化。进一步地，为向视频与文本注入更丰富的信息，我们提出了一种基于幻觉的增强方法，利用LLMs与VGMs生成并添加与原始数据相关的新信息。得益于增强后的数据，在多个视频-文本检索基准上的大量实验表明，HaVTR的性能优于现有方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日