Long-context capability is critical for multi-modal foundation models. We introduce LongVILA, a full-stack solution for long-context vision-language models, including system, model training, and dataset development. On the system side, we introduce the first long-context Multi-Modal Sequence Parallelism (MM-SP) system that enables long training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. MM-SP is 2.1x - 5.7x faster than ring sequence parallelism and 1.1x - 1.4x faster than Megatron context parallelism + tensor parallelism in text-only settings. Moreover, it seamlessly integrates with Hugging Face Transformers. For model training, we propose a five-stage pipeline comprising alignment, pre-training, short supervised fine-tuning, context extension, and long supervised fine-tuning. On datasets, we construct large-scale visual language pre-training datasets and long video instruction-following datasets to support our multi-stage training process. LongVILA extends the number of frames of VILA from 8 to 1024, and improves the long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length) needle-in-a-haystack. LongVILA-8B demonstrates consistent accuracy improvements on long videos in the VideoMME benchmark as the number of frames increases.
翻译:长上下文能力对于多模态基础模型至关重要。本文提出LongVILA,一个面向长上下文视觉语言模型的全栈解决方案,涵盖系统架构、模型训练与数据集构建三个层面。在系统层面,我们首次提出长上下文多模态序列并行(MM-SP)系统,支持长序列训练与推理,可在256个GPU上实现200万上下文长度的训练且无需梯度检查点。在纯文本场景下,MM-SP比环形序列并行快2.1-5.7倍,比Megatron上下文并行+张量并行快1.1-1.4倍。该系统可与Hugging Face Transformers实现无缝集成。在模型训练方面,我们设计了包含对齐、预训练、短序列监督微调、上下文扩展和长序列监督微调的五阶段训练流程。在数据集层面,我们构建了大规模视觉语言预训练数据集与长视频指令跟随数据集,以支持多阶段训练流程。LongVILA将VILA模型的帧处理能力从8帧扩展至1024帧,将长视频描述分数从2.00提升至3.26(提升1.6倍),并在1400帧视频(27.4万上下文长度)的“大海捞针”测试中达到99.5%的准确率。在VideoMME基准测试中,LongVILA-8B模型在长视频任务上的准确率随帧数增加呈现持续提升趋势。