Recent years have witnessed the success of Multimodal Large Language Models (MLLMs) in the vision understanding domain. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has mainly been powered by automatic data pipelines, which center around the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our main study approach is fine-tuning pre-trained image-LLMs with video data and investigating learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to baselines trained with many more samples. Meanwhile, we find that incorporating these synthetic samples can boost the performance of long video understanding without training with long video data. The code and data examples are available at https://github.com/VITA-MLLM/Sparrow.
翻译:近年来,多模态大语言模型(MLLMs)在视觉理解领域取得了显著成功。这些模型的成功很大程度上可归因于主导的缩放定律,即更大的参数量与数据量有助于提升性能。值得注意的是,数据缩放主要依赖于自动数据流水线,其核心在于大语言模型的自指令生成。这一范式已被视为理所当然相当长一段时间,但关于利用此类数据进行缩放的有效性研究却长期被忽视。在此背景下,本文重新审视了利用合成数据进行缩放的问题,并从数据中心的视角聚焦于开发视频-LLM。我们的主要研究方法是通过视频数据微调预训练的图像-LLM,并通过数据缩放探究学习效率。初步实验结果表明,在单纯增加视频数据样本时会出现学习效率低下的现象,经我们深入探查,这归因于指令多样性的缺乏。针对此问题,我们提出了一种名为“麻雀”的数据增强方法,该方法能够从纯文本指令数据中合成类视频样本。将这些合成样本与视频数据混合,可实现更高效的训练方案。通过全面的实验,我们证明了所提出的方法能够达到与使用更多样本训练的基线模型相当甚至更优的性能。同时,我们发现引入这些合成样本能够提升长视频理解能力,而无需使用长视频数据进行训练。代码与数据示例发布于 https://github.com/VITA-MLLM/Sparrow。