The growing demand for deploying Small Language Models (SLMs) on edge devices, including laptops, smartphones, and embedded platforms, has exposed fundamental inefficiencies in existing accelerators. While GPUs handle prefill workloads efficiently, the autoregressive decoding phase is dominated by GEMV operations that are inherently memory-bound, resulting in poor utilization and prohibitive energy costs at the edge. In this work, we present EdgeCIM, a hardware-software co-design framework that rethinks accelerator design for end-to-end decoder-only inference. At its core is a CIM macro, implemented in 65nm, coupled with a tile-based mapping strategy that balances pipeline stages, maximizing parallelism while alleviating DRAM bandwidth bottlenecks. Our simulator enables design space exploration of SLMs up to 4B parameters, identifying Pareto-optimal configurations in terms of latency and energy. Compared to an NVIDIA Orin Nano, EdgeCIM achieves up to 7.3x higher throughput and 49.59x better energy efficiency on LLaMA3.2-1B, and delivers 9.95x higher throughput than Qualcomm SA8255P on LLaMA3.2-3B. Extensive benchmarks on TinyLLaMA-1.1B, LLaMA3.2 (1B, 3B), Phi-3.5-mini-3.8B, Qwen2.5 (0.5B, 1.5B, 3B), SmolLM2-1.7B, SmolLM3-3B, and Qwen3 (0.6B, 1.7B, 4B) reveal that our accelerator, under INT4 precision, achieves on average 336.42 tokens/s and 173.02 tokens/J. These results establish EdgeCIM as a compelling solution towards real-time, energy-efficient edge-scale SLM inference.
翻译:将小语言模型部署到笔记本电脑、智能手机及嵌入式平台等边缘设备的需求日益增长,这使得现有加速器的根本性低效问题暴露无遗。尽管GPU能高效处理预填充工作负载,但自回归解码阶段以本质上受内存限制的GEMV运算为主导,导致资源利用率低下,边缘端能耗成本高昂。本文提出EdgeCIM——一种重新思考端到端纯解码器推理加速器设计的软硬件协同设计框架。其核心是采用65nm工艺实现的CIM宏单元,结合基于分块映射的流水线级平衡策略,在缓解DRAM带宽瓶颈的同时最大化并行度。我们的模拟器支持对高达40亿参数小语言模型的设计空间探索,可识别出延迟与能效层面的帕累托最优配置。与NVIDIA Orin Nano相比,EdgeCIM在LLaMA3.2-1B上可实现高达7.3倍的吞吐量提升和49.59倍的能效提升;在LLaMA3.2-3B上相比高通SA8255P实现9.95倍的吞吐量提升。对TinyLLaMA-1.1B、LLaMA3.2 (1B、3B)、Phi-3.5-mini-3.8B、Qwen2.5 (0.5B、1.5B、3B)、SmolLM2-1.7B、SmolLM3-3B及Qwen3 (0.6B、1.7B、4B)的广泛基准测试表明:在INT4精度下,我们的加速器平均可实现336.42 tokens/s的吞吐率与173.02 tokens/J的能效。这些结果确立了EdgeCIM作为面向实时、高能效边缘级小语言模型推理的理想方案。