Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparably to closed-source LLVMs such as GPT-4V are often considered too large (e.g., 26B, 34B, and 110B parameters), having a larger number of layers. These large models demand costly, high-end resources for both training and inference. To address this issue, we present a new efficient LLVM family with 1.8B, 3.8B, and 7B LLM model sizes, Traversal of Layers (TroL), which enables the reuse of layers in a token-wise manner. This layer traversing technique simulates the effect of looking back and retracing the answering stream while increasing the number of forward propagation layers without physically adding more layers. We demonstrate that TroL employs a simple layer traversing approach yet efficiently outperforms the open-source LLVMs with larger model sizes and rivals the performances of the closed-source LLVMs with substantial sizes.
翻译:大型语言与视觉模型(LLVMs)的发展得益于大型语言模型(LLMs)的泛化能力与视觉指令微调技术的兴起。除了直接扩大模型规模外,这些模型通过自然语言指令覆盖多样化任务,展现出强大的视觉语言(VL)性能。然而,现有开源LLVMs(如参数量达26B、34B和110B的模型)虽能达到与GPT-4V等闭源模型相当的性能,却因层数过多、规模过大,在训练和推理阶段均需耗费昂贵的高端计算资源。为解决此问题,我们提出一个高效的新型LLVM系列——层级遍历模型(TroL),其LLM参数量分别为1.8B、3.8B和7B。该模型通过令牌级层复用机制,在未实际增加物理层数的前提下,利用层级遍历技术模拟回溯与重定向应答流的效果,从而增加前向传播的等效层数。实验表明,TroL采用简洁的层级遍历方法,不仅能高效超越规模更大的开源LLVMs,其性能还可与规模庞大的闭源LLVMs相媲美。