Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of the most effective strategies to address this complexity. However, previous approaches fall short when applied to more complex task-oriented segmentation (TOS), where the class of each image patch is not predefined but dependent on the specific input task. This work introduces the Vision Language Guided Token Pruning (VLTP), a novel token pruning mechanism that can accelerate ViTbased segmentation models, particularly for TOS guided by multi-modal large language model (MLLM). We argue that ViT does not need to process every image token through all of its layers only the tokens related to reasoning tasks are necessary. We design a new pruning decoder to take both image tokens and vision-language guidance as input to predict the relevance of each image token to the task. Only image tokens with high relevance are passed to deeper layers of the ViT. Experiments show that the VLTP framework reduces the computational costs of ViT by approximately 25% without performance degradation and by around 40% with only a 1% performance drop.
翻译:视觉Transformer(ViT)已成为众多分割模型的核心架构,持续取得最先进的性能表现。然而,这种成功伴随着显著的计算开销。图像令牌剪枝是应对这一计算复杂度的最有效策略之一。然而,现有方法在应用于更复杂的面向任务分割(TOS)时存在不足,因为其中每个图像块所属的类别并非预先定义,而是取决于具体的输入任务。本文提出了视觉-语言引导令牌剪枝(VLTP),这是一种新颖的令牌剪枝机制,能够加速基于ViT的分割模型,尤其适用于由多模态大语言模型(MLLM)引导的TOS。我们认为,ViT无需在其所有层中处理每一个图像令牌,只有与推理任务相关的令牌才是必要的。我们设计了一种新的剪枝解码器,它以图像令牌和视觉-语言引导信息作为输入,来预测每个图像令牌与任务的相关性。只有高相关性的图像令牌才会被传递到ViT的更深层。实验表明,VLTP框架在保持性能不下降的情况下,可将ViT的计算成本降低约25%;在性能仅下降1%的情况下,计算成本可降低约40%。