Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.
翻译:大多数视觉-语言模型(VLM)采用大型语言模型(LLM)作为解码器,其中响应标记通过自回归方式顺序生成。因此,输出标记的数量可能成为端到端延迟的瓶颈。然而,不同模型可能需要截然不同数量的输出标记才能达到可比较的性能。在本工作中,我们基于模拟数据对VLM各组件延迟进行了全面分析。实验表明,输出标记较少的大模型可能比输出序列较长的小模型更高效。针对多种真实世界基准的实证研究证实了这一发现:大模型能够以显著更少的输出标记实现与小模型相当或更优的性能。为利用大模型的效率优势,我们提出一种多智能体推理框架,该框架保留大模型的短响应特性,同时在必要时从小模型迁移关键推理标记。基准任务上的比较表明,通过复用来自小模型的推理标记,该框架有助于接近大模型自行推理的性能水平,从而验证了我们提案的有效性。