Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.
翻译:当前的自回归视觉语言模型通常依赖大量视觉令牌来表示图像,这导致尤其在推理时需要更多的计算资源。为解决该问题,我们提出Mask-LLaVA框架,该框架利用不同层级的视觉特征为自回归视觉语言模型创建紧凑而信息丰富的视觉表示。具体而言,我们将基于掩码的物体表示与全局令牌及局部图像块令牌相结合。虽然训练时使用所有令牌,但实验表明所得模型在测试时能够灵活地减少特别是基于掩码的物体令牌数量,从而允许在推理过程中动态调整令牌数量,而无需重新训练模型且不会导致性能显著下降。我们在标准基准测试套件上评估所提出的方法,结果显示其性能与当前令牌高效方法相当,且仅使用少量视觉令牌即可达到与原始LLaVA基线相当的水平。我们的分析表明,结合多层级特征能够以更少的令牌实现高效学习,同时在测试时允许动态令牌选择以保持良好的性能。