细键满值：通过低维注意力选择减少KV缓存 (Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection)

Standard transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(\log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch -- unlike GQA and MLA, which must be designed into the architecture before pretraining. We factorize each key projection $W_K \approx A_{d \times r} B_{r \times d}$ via truncated SVD (where $r = d_{\text{select}}$), set $W_K' = A$ as the new key projection producing compact $r$-dimensional keys for the cache, and absorb $B^\top$ into the query projection ($W_Q' = W_Q B^\top$) at zero cost -- since queries are never cached. At 7B scale, training from scratch with $r = d_{\text{model}}/4$ matches full-attention perplexity (9.2 vs 9.3 PPL after 20B tokens) while using 12% fewer parameters and training 8% faster. For existing models, SVD + QK fine-tuning (3 epochs, less than 1% of pretraining data) achieves 75% key cache savings at approximately 2% quality cost on both GPT-2 and Mistral-7B. The approach composes with GQA and quantization for up to $16\times$ combined key cache compression. For a 7B model serving 128K context, factored keys save 25 GB of KV cache per user, enabling approximately 60% more concurrent users on identical hardware.

翻译：标准Transformer注意力机制对查询、键和值使用相同的维度，然而这些组件承担着不同的角色：查询和键产生标量注意力权重（选择），而值承载丰富的表示（值传递）。我们证明，选择仅需$O(\log N)$维度即可区分$N$个相关词元类别（例如，句法角色、语义簇、位置模式）——这远少于值传递所需的维度。我们引入分解键，利用这种不对称性来物理压缩任何预训练模型的KV缓存，而无需从头开始重新训练——这与必须在预训练前设计到架构中的GQA和MLA不同。我们通过截断SVD对每个键投影$W_K \approx A_{d \times r} B_{r \times d}$进行分解（其中$r = d_{\text{select}}$），将$W_K' = A$设为新的键投影以生成用于缓存的紧凑$r$维键，并将$B^\top$吸收到查询投影中（$W_Q' = W_Q B^\top$），这不会产生额外成本——因为查询从不被缓存。在7B规模下，使用$r = d_{\text{model}}/4$从头训练，其困惑度与全注意力机制相当（在20B词元后为9.2 vs 9.3 PPL），同时参数减少12%，训练速度提升8%。对于现有模型，SVD + QK微调（3个周期，少于预训练数据的1%）在GPT-2和Mistral-7B上实现了75%的键缓存节省，质量损失约为2%。该方法可与GQA和量化技术组合使用，实现高达$16\times$的键缓存综合压缩。对于一个服务128K上下文的7B模型，分解键为每个用户节省25 GB的KV缓存，使得在相同硬件上可支持约60%的并发用户。