Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the potential of inherent audio-visual correlation, leading to monotonous annotation within each modality instead of comprehensive and precise descriptions. Such ignorance results in the difficulty of multiple cross-modality studies. To fulfill this gap, we present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions, and 2M high-quality clips with multimodal captions. Trailers preview full-length video works and integrate context, visual frames, and background music. In particular, the trailer has two main advantages: (1) the topics are diverse, and the content characters are of various types, e.g., film, news, and gaming. (2) the corresponding background music is custom-designed, making it more coherent with the visual context. Upon these insights, we propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Here, to ensure the caption retains music perspective while preserving the authority of visual context, we leverage the advanced LLM to merge all annotations adaptively. In this fashion, our MMtrail dataset potentially paves the path for fine-grained large multimodal-language model training. In experiments, we provide evaluation metrics and benchmark results on our dataset, demonstrating the high quality of our annotation and its effectiveness for model training.
翻译:大规模多模态数据集对于推动大型视频-语言模型的发展具有重要作用。然而,当前的视频-语言数据集主要提供视觉帧的文本描述,将音频视为弱相关信息。这些数据集通常忽视挖掘内在的视听关联潜力,导致各模态内的标注趋于单调,而非全面精确的描述。这种忽视使得多模态交叉研究面临困难。为填补这一空白,我们提出了MMTrail,一个大规模多模态视频-语言数据集,包含超过2000万个带有视觉描述的预告片片段,以及200万个带有高质量多模态描述的片段。预告片是对完整视频作品的预览,融合了上下文、视觉帧和背景音乐。特别地,预告片具有两大优势:(1) 主题多样,内容类型丰富,例如电影、新闻和游戏;(2) 对应的背景音乐为定制设计,使其与视觉上下文更为连贯。基于这些洞察,我们提出了一个系统化的标注框架,对超过27.1千小时的预告片视频实现了多模态标注。为确保描述在保留音乐视角的同时维持视觉上下文的权威性,我们利用先进的大型语言模型自适应地融合所有标注。通过这种方式,我们的MMTrail数据集有望为细粒度的大型多模态-语言模型训练铺平道路。在实验中,我们提供了数据集的评估指标和基准测试结果,证明了标注的高质量及其对模型训练的有效性。