Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks.
翻译:尽管能力取得了巨大飞跃,多模态大语言模型(MLLMs)在实际使用中往往表现得像树懒一样,即响应缓慢、延迟高。近期研究致力于构建微型MLLMs以提升效率,但大量视觉标记的使用仍限制了其实际加速效果。本文提出了一种强大且快速的微型MLLM,称为FlashSloth。与先前工作不同,FlashSloth专注于在压缩冗余语义的过程中提升视觉标记的描述能力。具体而言,FlashSloth引入了嵌入式视觉压缩设计,以同时捕获视觉显著和指令相关的图像信息,从而用更少的视觉标记实现卓越的多模态性能。我们进行了大量实验以验证所提出的FlashSloth,并全面比较了一系列微型但强大的MLLMs,例如InternVL2、MiniCPM-V2和Qwen2-VL。实验结果表明,与这些先进的微型MLLMs相比,我们的FlashSloth能够在保持多种视觉语言任务高性能的同时,大幅减少视觉标记数量、训练内存和计算复杂度。