Matryoshka Query Transformer for Large Vision-Language Models

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

翻译：大型视觉语言模型通常将图像编码为固定数量的视觉标记（例如576个），并使用语言模型处理这些标记。尽管性能强大，但大型视觉语言模型在适应不同计算约束方面面临挑战。这引发了一个问题：我们能否实现视觉标记数量的灵活性，以适应不同任务和计算资源？我们对此给出了肯定的回答。受套娃表示学习的启发，我们提出了套娃查询Transformer，能够在推理过程中将图像编码为m个视觉标记，其中m可以是达到预设最大值的任意数量。这是通过采用具有M个潜在查询标记的查询Transformer来压缩视觉嵌入实现的。在每个训练步骤中，我们随机选择m ≤ M个潜在查询标记，并仅使用前m个标记训练模型，丢弃其余部分。将套娃查询Transformer与LLaVA结合，我们仅需训练一次单一模型，即可在推理时灵活且大幅减少视觉标记数量，同时保持与为每个标记数量独立训练模型相当或更优的性能。我们的模型MQT-LLaVA在11个基准测试中匹配了LLaVA-1.5的性能，而使用的最大标记数为256个，而非LLaVA固定的576个。将标记数减少至16个（计算量降低8倍）仅在MMBench上导致性能下降2.4个百分点。在特定任务如ScienceQA和MMMU上，我们甚至可以将视觉标记减少至仅2个，性能下降分别仅为3%和6%。我们对视觉标记数量带来的精度与计算成本之间权衡的探索，有助于未来研究实现两者的最佳平衡。