Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.
翻译:近期视觉令牌剪枝方法在适度令牌预算下能有效保持模型性能,但在超低令牌预算下变得不稳定。我们的分析表明,随着剪枝预算降低,精度下降通常伴随着更大的特征分布偏移。关键的是,这种分布偏移程度与性能下降呈强相关。为更好描述这一现象,我们引入轻量级分布一致性度量来估计保留令牌与完整令牌之间的分布偏移。受此观察启发,我们提出两阶段剪枝框架,包括锚点-上下文图恢复(ACGR)和文本感知令牌簇选择(TATCS)。具体地,ACGR在令牌移除前传递上下文信息,而TATCS在检测到严重分布偏移时动态重新选择代表性令牌。大量实验表明,我们的方法在超低令牌预算下实现了更优越且更稳定的性能。值得注意的是,在仅使用16个视觉令牌的情况下,该方法在LLaVA-1.5-7B上保留了92.1%的上界平均性能。