Inference Optimal VLMs Need Only One Visual Token but Larger Models

Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., $5-10\times$), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at https://github.com/locuslab/llava-token-compression.

翻译：视觉语言模型（VLMs）在各种视觉理解与推理任务中展现出强大能力。然而，由于大语言模型（LLM）处理大量输入标记（主要来自图像）需要大量计算，其推理过程中的高延迟往往限制了实际部署。为降低推理成本，既可缩减LLM规模，也可减少输入图像标记数量——后者已成为近期众多标记压缩研究的焦点。然而，由于这两个因素均直接影响VLM性能，其最优权衡关系尚不明确。我们首先通过建立刻画性能随这两个因素变化的缩放定律，揭示了视觉标记数量与LLM参数之间的最优权衡关系。研究结果呈现出令人惊讶的趋势：对于视觉推理任务，VLM的推理最优行为（即在给定固定推理计算量下实现最小下游误差）是通过在推理预算内使用最大规模的LLM，同时将视觉标记数量最小化——通常压缩至单个标记来实现的。尽管标记压缩研究主要关注通过适度减少标记数量（例如$5-10\times$）来保持基础模型性能，我们的结果表明，计算最优的推理机制需要在更高的标记压缩率下运行。基于这些发现，我们针对高标记压缩场景构建了初步的定制化方法。代码发布于https://github.com/locuslab/llava-token-compression。