In response to the rising interest in large multimodal models, we introduce Cross-Attention Token Pruning (CATP), a precision-focused token pruning method. Our approach leverages cross-attention layers in multimodal models, exemplified by BLIP-2, to extract valuable information for token importance determination. CATP employs a refined voting strategy across model heads and layers. In evaluations, CATP achieves up to 12.1X higher accuracy compared to existing token pruning methods, addressing the trade-off between computational efficiency and model precision.
翻译:针对大型多模态模型日益增长的研究需求,本文提出交叉注意力令牌剪枝(CATP)——一种以精度为核心的令牌剪枝方法。该方法利用以BLIP-2为代表的多模态模型中的交叉注意力层,提取用于令牌重要性判定的关键信息。CATP通过跨模型注意力头与网络层的精细化投票策略实现剪枝决策。实验评估表明,相较于现有令牌剪枝方法,CATP在计算效率与模型精度的权衡中取得突破,最高可实现12.1倍的精度提升。