Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens. It is an important but challenging task. Although recent advanced approaches achieved great success, they need to carefully handcraft a compression rate (i.e. number of tokens to remove), which is tedious and leads to sub-optimal performance. To tackle this problem, we propose Differentiable Compression Rate (DiffRate), a novel token compression method that has several appealing properties prior arts do not have. First, DiffRate enables propagating the loss function's gradient onto the compression ratio, which is considered as a non-differentiable hyperparameter in previous work. In this case, different layers can automatically learn different compression rates layer-wisely without extra overhead. Second, token pruning and merging can be naturally performed simultaneously in DiffRate, while they were isolated in previous works. Third, extensive experiments demonstrate that DiffRate achieves state-of-the-art performance. For example, by applying the learned layer-wise compression rates to an off-the-shelf ViT-H (MAE) model, we achieve a 40% FLOPs reduction and a 1.5x throughput improvement, with a minor accuracy drop of 0.16% on ImageNet without fine-tuning, even outperforming previous methods with fine-tuning. Codes and models are available at https://github.com/OpenGVLab/DiffRate.
翻译:摘要:令牌压缩旨在通过剪枝(丢弃)或合并令牌来加速大规模视觉Transformer(如ViT)。这是一项重要但具有挑战性的任务。尽管近期先进方法取得了巨大成功,但它们需要精心手工设计压缩率(即待移除令牌数量),这一过程繁琐且易导致次优性能。为解决此问题,我们提出可微分压缩率(DiffRate),一种新颖的令牌压缩方法,具备先前技术所不具备的多项吸引特性。首先,DiffRate能使损失函数的梯度传播至压缩率,这在先前工作中被视为不可微的超参数。这使得不同层无需额外开销即可自动学习逐层不同的压缩率。其次,DiffRate可自然实现令牌剪枝与合并的同步操作,而先前工作将二者孤立处理。第三,大量实验表明DiffRate达到了最先进的性能。例如,将学习到的逐层压缩率应用于现成的ViT-H(MAE)模型,在ImageNet上无需微调即可实现40%的FLOPs减少和1.5倍吞吐量提升,同时精度仅下降0.16%,甚至优于需要微调的先前方法。代码与模型详见 https://github.com/OpenGVLab/DiffRate。