Matryoshka Query Transformer for Large Vision-Language Models

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

翻译：大型视觉-语言模型（LVLMs）通常将图像编码为固定数量的视觉令牌（例如576个），并通过语言模型处理这些令牌。尽管性能强劲，但LVLMs在适应不同计算约束时面临挑战。这引发了一个问题：能否实现视觉令牌数量的灵活性，以适应不同任务和计算资源？我们对此给予肯定回答。受嵌套表示学习启发，我们提出嵌套查询变换器（MQT），能够在推理时将图像编码为m个视觉令牌，其中m可以是任意不超过预设最大值的数字。这通过采用包含M个潜在查询令牌的查询变换器来压缩视觉嵌入实现。在每个训练步骤中，我们随机选择m ≤ M个潜在查询令牌，并仅使用前m个令牌训练模型，丢弃其余令牌。将MQT与LLaVA结合，我们只需训练一次单一模型，即可灵活且显著减少推理时的视觉令牌数量，同时保持与针对每种令牌数量独立训练的模型相当或更优的性能。我们的模型MQT-LLaVA在11个基准测试中匹配LLaVA-1.5性能时，最多仅需256个令牌（而非LLaVA固定的576个）。当令牌数减少至16个（TFLOPs减少8倍）时，MMBench上的性能仅下降2.4分。在ScienceQA和MMMU等特定任务中，我们甚至可仅用2个视觉令牌，性能分别仅下降3%和6%。我们对视觉令牌数量带来的精度与计算成本权衡的探索，为未来实现两者最佳平衡的研究奠定了基础。