Recent years have witnessed the success of Multimodal Large Language Models (MLLMs) in the vision understanding domain. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has mainly been powered by automatic data pipelines, which center around the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our main study approach is fine-tuning pre-trained image-LLMs with video data and investigating learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to baselines trained with many more samples. Meanwhile, we find that incorporating these synthetic samples can boost the performance of long video understanding without training with long video data. The code and data examples are available at https://github.com/VITA-MLLM/Sparrow.
翻译:近年来,多模态大语言模型(MLLMs)在视觉理解领域取得了显著成功。这些模型的成功很大程度上归功于主导的缩放定律,该定律指出更大的参数量与数据量有助于提升性能。值得注意的是,数据缩放主要依赖于自动数据流水线,其核心在于大语言模型的自我指令生成。这一范式在相当长一段时间内被视为理所当然,但针对此类数据缩放有效性的研究却长期被忽视。在此背景下,本研究重新审视了基于合成数据的缩放问题,并从数据中心的视角聚焦于视频-LLM的开发。我们的主要研究方法是通过视频数据微调预训练的图像-LLM,并借助数据缩放探究学习效率。初步实验结果表明,单纯增加视频数据样本会出现学习效率低下的现象,经深入分析,这归因于指令多样性的缺失。针对此问题,我们提出了一种名为Sparrow的数据增强方法,该方法能够从纯文本指令数据中合成类视频样本。将这些合成样本与视频数据混合,可实现更高效的训练方案。通过全面实验,我们证明所提方法取得的性能可与使用更多样本训练的基线模型相媲美,甚至更优。同时,我们发现引入这些合成样本能够在无需长视频数据训练的情况下,提升长视频理解任务的性能。代码与数据示例发布于 https://github.com/VITA-MLLM/Sparrow。