GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models

Visual Language Models (VLMs) demonstrate impressive capabilities in processing multimodal inputs, yet applications such as visual agents, which require handling multiple images and high-resolution videos, demand enhanced long-range modeling. Moreover, existing open-source VLMs lack systematic exploration into extending their context length, and commercial models often provide limited details. To tackle this, we aim to establish an effective solution that enhances long context performance of VLMs while preserving their capacities in short context scenarios. Towards this goal, we make the best design choice through extensive experiment settings from data curation to context window extending and utilizing: (1) we analyze data sources and length distributions to construct ETVLM - a data recipe to balance the performance across scenarios; (2) we examine existing position extending methods, identify their limitations and propose M-RoPE++ as an enhanced approach; we also choose to solely instruction-tune the backbone with mixed-source data; (3) we discuss how to better utilize extended context windows and propose hybrid-resolution training. Built on the Qwen-VL series model, we propose Giraffe, which is effectively extended to 128K lengths. Evaluated on extensive long context VLM benchmarks such as VideoMME and Viusal Haystacks, our Giraffe achieves state-of-the-art performance among similarly sized open-source long VLMs and is competitive with commercial model GPT-4V. We will open-source the code, data, and models.

翻译：视觉语言模型（VLMs）在处理多模态输入方面展现出令人印象深刻的能力，然而诸如需要处理多张图像和高分辨率视频的视觉代理等应用，对增强的长程建模能力提出了需求。此外，现有的开源VLMs缺乏对其上下文长度扩展的系统性探索，而商业模型通常提供的细节有限。为解决此问题，我们的目标是建立一个有效的解决方案，在保持VLMs在短上下文场景中能力的同时，增强其长上下文性能。为实现这一目标，我们通过从数据构建到上下文窗口扩展及利用的广泛实验设置，做出了最佳设计选择：(1) 我们分析了数据源和长度分布，构建了ETVLM——一种用于平衡不同场景性能的数据配方；(2) 我们检验了现有的位置扩展方法，识别了其局限性，并提出M-RoPE++作为一种增强方法；我们还选择仅使用混合源数据对骨干模型进行指令微调；(3) 我们讨论了如何更好地利用扩展的上下文窗口，并提出了混合分辨率训练。基于Qwen-VL系列模型，我们提出了Giraffe，其上下文长度被有效扩展至128K。在VideoMME和Visual Haystacks等广泛的长上下文VLM基准测试中，我们的Giraffe在同等规模的开源长上下文VLMs中取得了最先进的性能，并与商业模型GPT-4V具有竞争力。我们将开源代码、数据和模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日