Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, improving the long video captioning score from 2.00 to 3.26 (out of 5), achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on the VideoMME benchmark, i.e., 61.8% with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.
翻译:长上下文能力对于多模态基础模型至关重要,尤其在长视频理解任务中。我们提出了LongVILA,一种通过算法与系统协同设计实现的长上下文视觉语言模型全栈解决方案。在模型训练方面,我们通过引入两个额外阶段——长上下文扩展与长视频监督微调——将现有视觉语言模型升级为支持长视频理解的模型。然而,长视频训练在计算与内存方面均面临巨大挑战。为此,我们提出了长上下文多模态序列并行系统,该系统能高效并行化长视频训练与推理过程,使得在256个GPU上无需任何梯度检查点即可完成200万上下文长度的训练。LongVILA成功将VILA模型处理的视频帧数从8帧扩展至2048帧,将长视频描述评分从2.00提升至3.26(满分5分),并在6000帧(超过100万token)的视频“大海捞针”任务中达到99.8%的准确率。LongVILA-7B在VideoMME基准测试中展现出优异性能,结合字幕时准确率达61.8%。此外,MM-SP系统相比环形序列并行方案提速2.1至5.7倍,相较于Megatron混合上下文与张量并行方案提速1.1至1.4倍。该系统还可与Hugging Face Transformers实现无缝集成。