The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We address this pressing issue by introducing a new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. Inspired by human attention patterns in Visual Question Answering (VQA) tasks, TRIM presents a fresh perspective on the selection and reduction of image tokens. The TRIM method has been extensively tested across 12 datasets, and the results demonstrate a significant reduction in computational overhead while maintaining a consistent level of performance. This research marks a critical stride in efficient MLLM development, promoting greater accessibility and sustainability of high-performing models.
翻译:多模态大语言模型的快速发展使其在各个领域取得了显著性能。然而,这一进步伴随着这些模型资源消耗的大幅激增。我们通过引入一种新方法——基于CLIP度量的令牌缩减,来解决这一紧迫问题,旨在不牺牲性能的前提下提升多模态大语言模型的效率。受视觉问答任务中人类注意力模式的启发,该方法为图像令牌的选择与缩减提供了全新视角。该方法已在12个数据集上进行了广泛测试,结果表明其在保持性能一致性的同时,显著降低了计算开销。这项研究标志着高效多模态大语言模型发展的关键一步,有助于推动高性能模型的可及性与可持续性。