Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational cost poses a significant barrier to wider application. To enhance inference efficiency, most existing approaches depend on parameter-dependent or token-dependent strategies to reduce computational demands. However, these methods typically require complex training processes and struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios. The code for this work will be made publicly available.
翻译:尽管大型视觉语言模型(LVLMs)已取得令人瞩目的成果,但其高昂的计算成本构成了广泛应用的重大障碍。为提升推理效率,现有方法大多依赖参数依赖或令牌依赖策略来降低计算需求。然而,这些方法通常需要复杂的训练过程,且难以持续选择最相关的令牌。本文系统分析了上述挑战,并为推理加速提供了一系列有价值的见解。基于这些发现,我们提出了一种新颖框架——剪枝全能选手(PAR)。与先前工作不同,PAR开发了一个元路由器,能够自适应地组织跨令牌和跨层的剪枝流程。通过自监督学习方式,我们的方法在性能与效率之间实现了卓越的平衡。值得注意的是,PAR具有高度灵活性,可提供多种剪枝版本以应对不同剪枝场景。本工作的代码将公开提供。