The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
翻译:视频大型多模态模型(LMMs)的发展一直受到从网络收集大量高质量原始数据困难的阻碍。为解决此问题,我们提出了一种替代方案,即专门为视频指令跟随任务创建一个高质量的合成数据集,命名为LLaVA-Video-178K。该数据集包含关键任务,如详细描述、开放式问答(QA)以及多项选择题问答。通过在此数据集上训练,并结合现有的视觉指令微调数据,我们引入了LLaVA-Video,一种新的视频LMM。我们的实验表明,LLaVA-Video在各种视频基准测试中均取得了强劲的性能,凸显了我们数据集的有效性。我们计划发布该数据集、其生成流程以及模型检查点。