Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen,Aliaksandr Siarohin,Willi Menapace,Ekaterina Deyneka,Hsiang-wei Chao,Byung Eun Jeon,Yuwei Fang,Hsin-Ying Lee,Jian Ren,Ming-Hsuan Yang,Sergey Tulyakov

from arxiv, CVPR 2024. Project Page: https://snap-research.github.io/Panda-70M

The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.

翻译：数据及标注的质量决定了下游模型的性能上限。尽管存在大规模文本语料库和图文对数据，高质量视频-文本数据的收集却困难得多。首先，人工标注需要标注者观看完整视频，耗时更长。其次，视频具有时间维度，通常由多个场景拼接而成，并展示多种动作行为。为此，我们提出一种利用多模态输入（如文本视频描述、字幕及单帧图像）的自动方法，以构建高质量描述的视频数据集。具体而言，我们从公开数据集HD-VILA-100M中筛选出380万条高分辨率视频，将其切分为语义连贯的视频片段，并应用多个跨模态教师模型为每个视频生成描述。随后，在手动选出每个视频最优描述的小规模子集上微调检索模型，并将该模型应用于整个数据集以选取最优描述作为标注。最终获得7000万个配以高质量文本描述的视频，该数据集命名为Panda-70M。我们在三项下游任务中验证了该数据集的价值：视频描述生成、视频-文本检索以及文本驱动的视频生成。基于所提数据训练的模型在绝大多数评估指标上表现显著优于其他方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日