We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.
翻译:我们首次对训练上下文长度高达344K的长上下文视觉语言模型进行了全面的大规模研究,旨在实现长文档视觉问答,并测量其向长上下文文本任务的迁移能力。尽管已有一些强大的开源模型,如Qwen3 VL和GLM 4.5/6V,但其训练方案与数据流程不可复现。我们系统性地研究了24B和32B参数模型的持续预训练、监督微调与偏好优化,辅以广泛的长上下文评估与消融实验以弥合这一差距,并在MMLongBenchDoc基准上为两种参数规模均取得了最先进的性能。此外,我们的关键发现包括:(i) 在匹配评估上下文长度的条件下进行训练,其效果优于在更长上下文上训练;(ii) 在训练与评估中使用页面索引能简单而显著地提升长文档性能;(iii) 我们的合成数据流程支持通过持续预训练与监督微调实现自我改进;(iv) 我们将已知的文本到视觉的长上下文迁移扩展至反向迁移,证明了视觉长上下文训练可迁移至长上下文文本性能。我们还发布了MMLBD-C,即MMLongBenchDoc的人工校正版本,以减少基准中错误与低质量的样本。