Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully leverage useful information in multimodal video content (frames, tags, ASR transcripts, etc.) to refine the original annotations. Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel noise control method that requires weaker assumptions on noise distribution, thereby proving more effective in large datasets with theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.
翻译:近年来,视频-语言理解通过大规模预训练取得了显著成功。然而,数据稀缺性仍是普遍存在的挑战。本研究定量揭示了预训练数据集中数据规模、多样性与质量之间存在的“不可能三元悖论”。近期研究尝试通过合成标注来优化规模庞大、多样性丰富但质量受限的自动语音识别数据集。这些方法成功利用了多模态视频内容中的有效信息来优化原始标注。然而,它们难以有效抑制合成标注中的噪声,且缺乏随数据集规模扩展的可扩展性。为解决这些问题,我们提出了Video DataFlywheel框架,该框架通过改进的噪声控制方法迭代优化视频标注。在迭代优化阶段,我们首先利用视频-语言模型生成合成标注,从而获得优化后的数据集;随后在其上进行预训练,并通过人工精校示例进行微调以得到更强模型。该过程可循环执行以实现持续改进。在噪声控制方面,我们提出了AdaTaiLr方法——一种对噪声分布假设要求更低的新型噪声控制方法,通过理论保证证明其在大规模数据集中具有更高有效性。迭代优化与AdaTaiLr的结合能够显著提升视频-语言理解任务的可扩展性。大量实验表明,本框架在现有数据优化基线方法中表现优异,可实现3%的性能提升,并以极小的多样性损失显著改善数据集质量。此外,经我们优化的数据集在视频问答、文本-视频检索等多种视频-语言理解任务中均能带来显著性能改进。