Transformer inference often requires a large KV cache, especially for long-context language modeling and multimodal generation. Existing compression methods usually reduce cache cost by selecting, evicting, quantizing, or compressing cached tokens, or by reducing the visual-token sequence before language-model inference. We introduce sub-token routing, a KV-compression method that adds a finer control axis inside retained tokens. It splits each retained value vector into groups and keeps only selected groups, while leaving query and key states unchanged. The method is designed to work after token-level reduction. First, a token-reduction method determines which tokens are retained. Then, sub-token routing compresses the value states inside those retained tokens. Experiments under matched KV budgets show that adding sub-token routing improves token-level reduction performance in both LLM and VLM settings, including Quest on LLaMA-2-7B and Qwen2.5-7B, and FastV/VisionZip across LLaVA and Qwen-VL models. The gains are larger at smaller KV budgets, suggesting that value-group routing is especially useful when further token removal becomes costly. Overall, token-level reduction and sub-token routing provide complementary ways to reduce KV cost.
翻译:Transformer推理通常需要大型KV缓存,尤其在长上下文语言建模和多模态生成任务中。现有压缩方法通常通过选择、驱逐、量化或压缩缓存令牌,或在语言模型推理前缩减视觉令牌序列来降低缓存开销。我们提出子令牌路由(sub-token routing)这一KV压缩方法,该方法在保留令牌内部增加了更精细的控制维度。它将每个保留值向量拆分为多个组并仅保留选定组,同时保持查询和键状态不变。该方法设计在令牌级缩减之后运行:首先通过令牌缩减方法确定保留哪些令牌,随后子令牌路由压缩这些保留令牌内部的值状态。在匹配KV预算的实验表明:在LLM和VLM场景下,添加子令牌路由能提升令牌级缩减性能,包括基于LLaMA-2-7B和Qwen2.5-7B的Quest方法,以及基于LLaVA和Qwen-VL模型的FastV/VisionZip方法。在较小KV预算下增益更为显著,表明当进一步令牌移除代价增大时,值组路由尤为有效。总体而言,令牌级缩减与子令牌路由提供了互补的KV成本降低途径。