KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $λ$ $+$ per-group abs-max $+$ int4 nibble pack), exposed as a HuggingFace \texttt{Cache} subclass, runs \emph{faster than fp16} across $256$--$4096$-token prefixes on Gemma-3 1B ($-3$ to $-8\%$ ms/tok) and at short context on Qwen2.5-1.5B ($-0.7$ to $-2.6\%$ through $1$K), with $3\times$ persistent memory compression and quality preserved ($\dPPL = 0.000$ Qwen short-prompt; $+3.6$ hook $\dPPL$ Gemma). The kernel's $\sim\!25$\,ns/vec overhead is below the bandwidth savings from $3\times$ compression. The fused kernel also closes Qwen's 4-bit per-token catastrophe ($\dPPL = +7975 \to +638.6$, $12.5\times$ reduction) at $182$\,GFLOPS / $D{=}128$. Supporting findings: $\SRFT$ and $\SRHT$ are statistically indistinguishable for KV quality (we pick $\SRFT$ for mixed-radix and matrix-multiply alignment); a learned-rotation ablation surfaces a regularization role for the fixed random SRFT base (learning $R+λ$ without SRFT lowers calibration MSE $84.9\%$ vs $50.3\%$ but yields worse PPL); Householder rotations at $k{=}d/2$ reflectors are effectively lossless at $d{=}256$.
翻译:KV缓存量化通常被视为质量与延迟之间的权衡。我们证明,在Apple Silicon的统一内存架构下,这一关系被彻底颠覆:一个融合的Metal核函数(符号随机化FFT $+$ 逐通道$λ$ $+$ 逐组绝对最大值 $+$ int4半字节打包),以HuggingFace \texttt{Cache}子类形式暴露,在Gemma-3 1B模型上处理$256$至$4096$词元前缀时,其运行速度超越fp16(每词元毫秒数降低$-3$至$-8\%$),在Qwen2.5-1.5B模型的短上下文场景下(直至1K词元)亦如此(降低$-0.7$至$-2.6\%$),同时实现$3\times$持久内存压缩并保持质量不变(Qwen短提示的$\dPPL = 0.000$;Gemma的$+3.6$钩子$\dPPL$)。该核函数约$25$纳秒/向量的开销低于$3\times$压缩带来的带宽节省。融合核函数还解决了Qwen模型4比特逐词元灾难问题($\dPPL$从$+7975$降至$+638.6$,降低$12.5$倍),性能达$182$ GFLOPS / $D{=}128$。支撑性发现:$\SRFT$与$\SRHT$在KV质量上统计上无显著差异(我们选择$\SRFT$以利用其混合基数和矩阵乘法对齐特性);学习旋转的消融实验揭示了固定随机SRFT基的正则化作用(无SRFT时学习$R+λ$将校准MSE从$50.3\%$降至$84.9\%$,但导致更差的困惑度);当反射器数量$k{=}d/2$时,Householder旋转在$d{=}256$维度上可实现近乎无损。