Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.
翻译:近期文本到视频(T2V)生成技术的进展突显了高质量视频-文本配对数据在训练能够生成连贯且符合指令的视频模型中的关键作用。然而,专门针对T2V训练优化视频描述的策略仍未得到充分探索。本文提出VC4VG(面向视频生成的视频描述),一个专为T2V模型需求定制的全面描述优化框架。我们首先从T2V视角分析描述内容,将视频重建所需的核心要素分解为多个维度,并提出一种原则性的描述设计方法。为支持评估,我们构建了VC4VG-Bench,这是一个新的基准测试集,其包含与T2V特定需求对齐的细粒度、多维度且必要性分级的评估指标。大量的T2V微调实验表明,描述质量的提升与视频生成性能之间存在强相关性,验证了我们方法的有效性。我们在 https://github.com/qyr0403/VC4VG 发布了所有基准工具和代码,以支持进一步研究。