Vision-Language Transformers (VLTs) have shown great success recently, but are meanwhile accompanied by heavy computation costs, where a major reason can be attributed to the large number of visual and language tokens. Existing token pruning research for compressing VLTs mainly follows a single-modality-based scheme yet ignores the critical role of aligning different modalities for guiding the token pruning process, causing the important tokens for one modality to be falsely pruned in another modality branch. Meanwhile, existing VLT pruning works also lack the flexibility to dynamically compress each layer based on different input samples. To this end, we propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs. Specifically, we first introduce a well-designed Multi-modality Alignment Guidance (MAG) module that can align features of the same semantic concept from different modalities, to ensure the pruned tokens are less important for all modalities. We further design a novel Dynamic Token Pruning (DTP) module, which can adaptively adjust the token compression ratio in each layer based on different input instances. Extensive experiments on various benchmarks demonstrate that MADTP significantly reduces the computational complexity of kinds of multimodal models while preserving competitive performance. Notably, when applied to the BLIP model in the NLVR2 dataset, MADTP can reduce the GFLOPs by 80% with less than 4% performance degradation.
翻译:视觉-语言Transformer(VLTs)近年来取得了巨大成功,但同时也伴随着高昂的计算成本,其主要原因可归因于大量的视觉和语言词元。现有的用于压缩VLTs的词元剪枝研究主要遵循基于单模态的方案,却忽略了跨模态对齐对指导词元剪枝过程的关键作用,导致一个模态分支中的重要词元在另一个模态分支中被错误剪枝。同时,现有的VLT剪枝工作也缺乏根据不同输入样本对各层进行动态压缩的灵活性。为此,我们提出了一种新颖框架,名为多模态对齐引导的动态词元剪枝(MADTP),用于加速各类VLTs。具体而言,我们首先引入了一个精心设计的多模态对齐引导(MAG)模块,该模块能够对齐来自不同模态的同一语义概念的特征,以确保被剪枝的词元对所有模态而言重要性较低。我们还进一步设计了一种新颖的动态词元剪枝(DTP)模块,该模块能够根据不同的输入实例自适应地调整每一层的词元压缩比。在多种基准上的广泛实验表明,MADTP在显著降低各类多模态模型计算复杂度的同时,保持了具有竞争力的性能。值得注意的是,当应用于NLVR2数据集上的BLIP模型时,MADTP能够减少80%的GFLOPs,且性能下降不到4%。