The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out-of-core GPU-based and SSD-like accelerators are limited by DRAM-bound weight movement and inefficient storage access granularity. We present NVLLM, a 3D NAND-centric inference architecture that offloads feed-forward network (FFN) computation into the Flash while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, error correction code (ECC) units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM/GEMV operations are decomposed into dot-product primitives executed by out-of-order PE lanes, operating directly on raw NAND reads with integrated ECC. Attention weights remain in DRAM, and a KV-cache-aware scheduler sustains throughput as the context length grows. Evaluated on OPT and LLaMA models with up to 30B parameters, NVLLM achieves a 16.7$\times$--37.9$\times$ speedup over A800-based out-of-core inference and up to 4.7$\times$ speedup over SSD-like designs, with only 2.7\% CMOS area overhead.
翻译:大语言模型的快速发展对资源受限的边缘设备提出了高吞吐量、大内存容量的推理需求,其中单批次解码本质上受限于内存性能。现有基于GPU的存外计算加速器及类SSD加速器受制于DRAM受限的权重移动和低效的存储访问粒度。本文提出NVLLM——一种以3D NAND为核心的推理架构:它将前馈网络计算卸载到闪存中执行,同时通过轻量级CMOS逻辑与外部DRAM运行注意力机制。通过晶圆对晶圆堆叠技术,NVLLM将多平面3D NAND与计算流水线、纠错码单元和缓存紧密集成,实现了无需经过DRAM的页级前馈网络权重访问。所有通用矩阵乘法/通用矩阵向量乘法运算均分解为点积原语,由乱序执行处理单元通道直接处理集成ECC的原始NAND读取数据。注意力权重保留在DRAM中,并采用支持KV缓存的调度器以在上下文长度增长时维持吞吐量。在包含高达300亿参数的OPT和LLaMA模型上的评估显示,NVLLM相比基于A800的存外推理实现16.7倍至37.9倍加速,相比类SSD设计实现最高4.7倍加速,且仅增加2.7%的CMOS面积开销。