The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.
翻译:数据及标注的质量决定了下游模型的性能上限。尽管存在大规模文本语料库和图文对数据,高质量视频-文本数据的收集却困难得多。首先,人工标注需要标注者观看完整视频,耗时更长。其次,视频具有时间维度,通常由多个场景拼接而成,并展示多种动作行为。为此,我们提出一种利用多模态输入(如文本视频描述、字幕及单帧图像)的自动方法,以构建高质量描述的视频数据集。具体而言,我们从公开数据集HD-VILA-100M中筛选出380万条高分辨率视频,将其切分为语义连贯的视频片段,并应用多个跨模态教师模型为每个视频生成描述。随后,在手动选出每个视频最优描述的小规模子集上微调检索模型,并将该模型应用于整个数据集以选取最优描述作为标注。最终获得7000万个配以高质量文本描述的视频,该数据集命名为Panda-70M。我们在三项下游任务中验证了该数据集的价值:视频描述生成、视频-文本检索以及文本驱动的视频生成。基于所提数据训练的模型在绝大多数评估指标上表现显著优于其他方法。