Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantification, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two key factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates the data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without retraining and trivial engineering efforts. On multiple public VLMs benchmarks, we conduct extensive experiments to reveal the gratifying acceleration of Turbo, under negligible performance drop.
翻译:视觉-语言大模型(VLM)凭借其卓越性能已成为人工智能的核心支柱。然而,其高昂的计算成本(即吞吐量和延迟)限制了在实际场景中的应用潜力。为实现VLM加速,现有方法多聚焦于模型层面:剪枝、蒸馏、量化,却完全忽视了数据层面的冗余。为填补这一空白,本文首次揭示了数据冗余的严重性,并设计了一种基于信息度引导的即插即用Turbo模块,用于从视觉或文本数据中剔除低效标记。为实现效率与性能的权衡,信息度综合考虑两个关键因素:互冗余度和语义价值。具体而言,前者评估序列标记间的数据重复性;后者则根据每个标记对整体语义的贡献度进行评价。由此,高信息度标记兼具低冗余性和强语义性。针对VLM计算,Turbo作为一种用户友好型插件,通过信息度对数据进行排序,仅利用顶层标记以节省成本。其优势体现在多个方面:例如,可广泛兼容各类理解与生成型VLM,无需重训练即可简单使用,且无需繁琐工程投入。我们在多个公开VLM基准上开展了大量实验,结果表明Turbo在性能下降可忽略不计的情况下,实现了令人满意的加速效果。