LLM inference is increasingly limited by memory bandwidth, and the bottleneck worsens at long context as the KV cache grows. CXL memory adds capacity to offload weights and KV, but its link and device-side DDR bandwidth are far below HBM, so decoding stalls once traffic shifts to the CXL tier. Many CXL controllers are starting to add generic \emph{lossless} compression, yet applying commodity codecs directly to standard word-major LLM tensors is largely ineffective, especially for token-major KV streams. We propose TRACE (\textbf{T}raffic-\textbf{R}educed \textbf{A}rchitecture for \textbf{C}ompression and \textbf{E}lasticity), which preserves the unmodified CXL.mem interface but changes the device-internal representation. It stores tensors in a channel-major, disaggregated bit-plane layout, and applies a KV-specific transform before compression, converting mixed-field words into low-entropy plane streams that commodity codecs can compress. The same substrate enables precision-proportional fetch by reading only the required bit-planes. Across public LLMs, TRACE reduces BF16 weight footprint by 25.2\% and BF16 KV footprint by 46.9\% losslessly, with per-layer KV ratios peaking at 2.69$\times$. In trace-driven system modeling, once KV spills to CXL, GPT-OSS-120B-MXFP4 improves throughput at 128k tokens from 16.28 to 68.99 tok/s (4.24$\times$). DRAMSim3 shows up to 40.3\% lower DRAM access energy under plane-aligned fetch. A 7\,nm SystemVerilog implementation sustains 256\,GB/s device bandwidth. Relative to a CXL controller with generic inline lossless compression, TRACE only adds 7.2\% area, 4.7\% power, and 6.0\% load-to-use latency at 2\,GHz and 0.7\,V.
翻译:大语言模型推理日益受限于内存带宽,且随着KV缓存的增长,长上下文场景下的瓶颈问题进一步加剧。CXL内存通过增加容量来卸载权重和KV缓存,但其链路及设备端DDR带宽远低于HBM,一旦数据流量转移至CXL层,解码过程便会陷入停滞。目前许多CXL控制器开始集成通用无损压缩功能,但将商用编解码器直接应用于标准的按字优先排列的LLM张量效果甚微,尤其对于按令牌优先排列的KV数据流。本文提出TRACE(面向压缩与弹性的流量优化架构),该架构在保持CXL.mem接口不变的前提下,重构了设备内部的数据表示形式。通过以通道优先、解耦的位平面布局存储张量,并在压缩前对KV数据实施专用变换,将混合字段的字数据转换为低熵的平面流,使得商用编解码器能够高效压缩。该基础架构还支持精度按需读取机制,仅需读取必要的位平面即可完成数据获取。在公开LLM测试中,TRACE将BF16权重的存储占用降低25.2%,BF16 KV缓存占用无损降低46.9%,单层KV压缩比峰值达2.69倍。基于实际轨迹的系统建模显示,当KV缓存溢出至CXL时,GPT-OSS-120B-MXFP4模型在128k令牌场景下的吞吐率从16.28 tok/s提升至68.99 tok/s(4.24倍加速)。DRAMSim3仿真表明位平面对齐读取可降低DRAM访问能耗达40.3%。采用7纳米工艺的SystemVerilog实现方案可维持256 GB/s的设备带宽。相较于配备通用内联无损压缩的CXL控制器,TRACE在2 GHz频率与0.7 V电压下仅增加7.2%的面积开销、4.7%的功耗以及6.0%的加载到使用延迟。