KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV-Embedding.
翻译:KV缓存通常仅用于加速自回归解码,但其编码的上下文信息可在无需额外成本的情况下复用于下游任务。我们提出将KV缓存视为一种轻量级表示,从而无需重新计算或存储完整的隐藏状态。尽管相较于专用嵌入方法能力较弱,但KV衍生的表示被证明足以支持两个关键应用:\textbf{(i) 嵌入链},其在Llama-3.1-8B-Instruct和Qwen2-7B-Instruct上实现了具有竞争力或更优的性能;以及\textbf{(ii) 快/慢思维切换},其在Qwen3-8B和DeepSeek-R1-Distil-Qwen-14B上实现了自适应推理,在精度损失最小的情况下将token生成量减少高达$5.7\times$。我们的研究确立了KV缓存作为一种免费且有效的采样与推理基础,为LLM推理中的表示复用开辟了新方向。代码:https://github.com/cmd2001/ICLR2026_KV-Embedding。