We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design -- transform on write, table-lookup on read with no inverse transform -- reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.
翻译:我们提出AXELRAM,一种智能SRAM宏架构,可直接从量化后的KV缓存索引计算注意力分数,无需反量化。其关键使能技术是设计时固定的码本:基于正交变换的量化将每个坐标的分布集中于N(0,1/d),因此最优量化器仅取决于维度d和位宽b,而与输入数据无关。非对称路径设计——写入时变换,读取时查表且无逆变换——将每次查询的乘法运算量减少102.4倍(基于数学恒等式)。通过多种子评估(10个种子×3个模型),我们发现符号模式敏感性会导致某些模型(如Qwen2.5-3B)出现灾难性困惑度尖峰(Delta > 50),而其他模型(如LLaMA-3.1-8B)则完全稳定。这一现象将SpinQuant在权重量化中观察到的旋转方差扩展至KV缓存领域,且其影响在性质上更为严重。我们追溯其根本原因为层间范数异质性,并提出一种无梯度的符号模式选择方法(200个候选,8个校准样本,一次性处理),可在零额外硬件成本下消除灾难性尖峰。所有源代码已公开于https://github.com/Axelidea/AXELRAM。