DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering $N$ layers in the language and vision transformer of LMMs, we stack the visual tokens into $N$ groups and feed each group to its aligned transformer layer \textit{from bottom to top}. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same context length, our DeepStack 7B and 13B parameters surpass their counterparts by \textbf{2.7} and \textbf{2.9} on average across \textbf{9} benchmarks, respectively. Using only one-fifth of the context length, DeepStack rivals closely to the counterparts that use the full context length. These gains are particularly pronounced on high-resolution tasks, e.g., \textbf{4.2}, \textbf{11.0}, and \textbf{4.0} improvements on TextVQA, DocVQA, and InfoVQA compared to LLaVA-1.5-7B, respectively. We further apply DeepStack to vision transformer layers, which brings us a similar amount of improvements, \textbf{3.8} on average compared with LLaVA-1.5-7B.

翻译：大多数大型多模态模型（LMMs）通过将视觉令牌作为序列馈送到大型语言模型（LLM）的第一层来实现。由此产生的架构虽然简单，但显著增加了计算和内存成本，因为其输入层必须处理大量额外的令牌。本文提出了一种用于LMMs的新架构DeepStack。考虑到LMMs中语言和视觉Transformer的$N$层，我们将视觉令牌堆叠成$N$组，并将每组馈送到其对齐的Transformer层（\textit{自底向上}）。令人惊讶的是，这种简单的方法极大地增强了LMMs跨层建模视觉令牌间交互的能力，而额外成本极低。我们将DeepStack应用于LMMs中的语言和视觉Transformer，并通过大量实证结果验证了DeepStack LMMs的有效性。在相同上下文长度下，我们的DeepStack 7B和13B参数模型在\textbf{9}个基准测试上的平均表现分别超过其对应模型\textbf{2.7}和\textbf{2.9}分。仅使用五分之一的上下文长度，DeepStack的表现与使用完整上下文长度的对应模型非常接近。这些收益在高分辨率任务上尤为显著，例如，与LLaVA-1.5-7B相比，在TextVQA、DocVQA和InfoVQA上分别取得了\textbf{4.2}、\textbf{11.0}和\textbf{4.0}的改进。我们进一步将DeepStack应用于视觉Transformer层，这带来了相似程度的改进，与LLaVA-1.5-7B相比平均提升\textbf{3.8}分。