Development of multimodal models has marked a significant step forward in how machines understand videos. These models have shown promise in analyzing short video clips. However, when it comes to longer formats like movies, they often fall short. The main hurdles are the lack of high-quality, diverse video data and the intensive work required to collect or annotate such data. In face of these challenges, we propose MovieLLM, a novel framework designed to synthesize consistent and high-quality video data for instruction tuning. The pipeline is carefully designed to control the style of videos by improving textual inversion technique with powerful text generation capability of GPT-4. As the first framework to do such thing, our approach stands out for its flexibility and scalability, empowering users to create customized movies with only one description. This makes it a superior alternative to traditional data collection methods. Our extensive experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets regarding scarcity and bias.
翻译:多模态模型的发展标志着机器理解视频能力迈出了重要一步。这些模型在分析短视频片段方面已展现出潜力,然而面对电影等长视频格式时,其表现往往不尽如人意。主要障碍在于缺乏高质量、多样化的视频数据,以及收集或标注此类数据所需的大量工作。针对这些挑战,我们提出了MovieLLM——一个专为指令微调合成高质量连贯视频数据而设计的新型框架。该流程通过结合GPT-4强大的文本生成能力改进文本反演技术,实现了对视频风格的精细控制。作为首个实现此功能的框架,我们的方法以其灵活性和可扩展性脱颖而出,用户仅需一段描述即可创建定制化电影,这使其成为传统数据收集方法的优越替代方案。大量实验证明,MovieLLM生成的数据能显著提升多模态模型在理解复杂视频叙事方面的性能,有效克服现有数据集在稀缺性和偏差性方面的局限。