Despite achieving remarkable performance on various vision-language tasks, Transformer-based pretrained vision-language models (VLMs) still suffer from efficiency issues arising from long inputs and numerous parameters, limiting their real-world applications. However, the huge computation is redundant for most samples and the degree of redundancy and the respective components vary significantly depending on tasks and input instances. In this work, we propose an adaptive acceleration method SmartTrim for VLMs, which adjusts the inference overhead based on the complexity of instances. Specifically, SmartTrim incorporates lightweight trimming modules into the backbone to perform task-specific pruning on redundant inputs and parameters, without the need for additional pre-training or data augmentation. Since visual and textual representations complement each other in VLMs, we propose to leverage cross-modal interaction information to provide more critical semantic guidance for identifying redundant parts. Meanwhile, we introduce a self-distillation strategy that encourages the trimmed model to be consistent with the full-capacity model, which yields further performance gains. Experimental results demonstrate that SmartTrim significantly reduces the computation overhead (2-3 times) of various VLMs with comparable performance (only a 1-2% degradation) on various vision-language tasks. Compared to previous acceleration methods, SmartTrim attains a better efficiency-performance trade-off, demonstrating great potential for application in resource-constrained scenarios.
翻译:尽管基于Transformer的预训练视觉语言模型(VLM)在各类视觉语言任务上取得了显著性能,但其仍受限于长输入和大量参数导致的效率问题,阻碍了实际应用。然而,大多数样本存在大量冗余计算,且冗余程度及对应组件的冗余性因任务和输入实例而异。为此,本文提出面向VLM的自适应加速方法SmartTrim,可根据实例复杂度动态调整推理开销。具体而言,SmartTrim在骨干网络中嵌入轻量级修剪模块,对冗余输入及参数执行任务特定剪枝,无需额外预训练或数据增强。鉴于视觉与文本表征在VLM中具有互补性,我们提出利用跨模态交互信息为识别冗余部分提供更关键的语义指导。同时引入自蒸馏策略,促使剪枝后模型与全容量模型保持一致性,从而进一步提升性能。实验结果表明,SmartTrim可在各类视觉语言任务中显著降低多种VLM的计算开销(2-3倍),同时保持可比的性能(仅下降1-2%)。相较于现有加速方法,SmartTrim实现了更优的效率-性能权衡,在资源受限场景中展现出巨大应用潜力。