Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models \qinghao{by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, {\em i.e.}, long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, improving the long video captioning score from 2.00 to 3.26 (out of 5), achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on the VideoMME benchmark, i.e., 61.8% with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.
翻译:长上下文能力对于多模态基础模型至关重要,尤其在长视频理解任务中。我们提出了LongVILA,一个面向长上下文视觉语言模型的全栈解决方案,其核心在于算法与系统的协同设计。在模型训练方面,我们通过引入两个额外阶段——即长上下文扩展与长视频监督微调——对现有视觉语言模型进行升级,以支持长视频理解。然而,长视频训练在计算和内存方面消耗巨大。我们提出了长上下文多模态序列并行系统,该系统能高效并行化长视频训练与推理,使得在256个GPU上无需任何梯度检查点即可进行200万上下文长度的训练。LongVILA高效地将VILA模型处理的视频帧数从8帧扩展到2048帧,将长视频描述评分从2.00提升至3.26(满分5分),并在6000帧(超过100万个标记)的视频“大海捞针”任务中达到了99.8%的准确率。LongVILA-7B模型在VideoMME基准测试中展现出强大的准确性,例如,在包含字幕的情况下达到61.8%。此外,MM-SP系统比环形序列并行快2.1倍至5.7倍,比结合上下文并行与张量并行的Megatron方案快1.1倍至1.4倍。同时,该系统能与Hugging Face Transformers实现无缝集成。