MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

Xiaowei Chi,Yatian Wang,Aosong Cheng,Pengjun Fang,Zeyue Tian,Yingqing He,Zhaoyang Liu,Xingqun Qi,Jiahao Pan,Rongyu Zhang,Mengfei Li,Ruibin Yuan,Yanbing Jiang,Wei Xue,Wenhan Luo,Qifeng Chen,Shanghang Zhang,Qifeng Liu,Yike Guo

from arxiv, 15 Pages. Dataset report

Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the potential of inherent audio-visual correlation, leading to monotonous annotation within each modality instead of comprehensive and precise descriptions. Such ignorance results in the difficulty of multiple cross-modality studies. To fulfill this gap, we present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions, and 2M high-quality clips with multimodal captions. Trailers preview full-length video works and integrate context, visual frames, and background music. In particular, the trailer has two main advantages: (1) the topics are diverse, and the content characters are of various types, e.g., film, news, and gaming. (2) the corresponding background music is custom-designed, making it more coherent with the visual context. Upon these insights, we propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Here, to ensure the caption retains music perspective while preserving the authority of visual context, we leverage the advanced LLM to merge all annotations adaptively. In this fashion, our MMtrail dataset potentially paves the path for fine-grained large multimodal-language model training. In experiments, we provide evaluation metrics and benchmark results on our dataset, demonstrating the high quality of our annotation and its effectiveness for model training.

翻译：大规模多模态数据集对于推动大型视频-语言模型的发展具有重要作用。然而，当前的视频-语言数据集主要提供视觉帧的文本描述，将音频视为弱相关信息。这些数据集通常忽视挖掘内在的视听关联潜力，导致各模态内的标注趋于单调，而非全面精确的描述。这种忽视使得多模态交叉研究面临困难。为填补这一空白，我们提出了MMTrail，一个大规模多模态视频-语言数据集，包含超过2000万个带有视觉描述的预告片片段，以及200万个带有高质量多模态描述的片段。预告片是对完整视频作品的预览，融合了上下文、视觉帧和背景音乐。特别地，预告片具有两大优势：(1) 主题多样，内容类型丰富，例如电影、新闻和游戏；(2) 对应的背景音乐为定制设计，使其与视觉上下文更为连贯。基于这些洞察，我们提出了一个系统化的标注框架，对超过27.1千小时的预告片视频实现了多模态标注。为确保描述在保留音乐视角的同时维持视觉上下文的权威性，我们利用先进的大型语言模型自适应地融合所有标注。通过这种方式，我们的MMTrail数据集有望为细粒度的大型多模态-语言模型训练铺平道路。在实验中，我们提供了数据集的评估指标和基准测试结果，证明了标注的高质量及其对模型训练的有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日