Deploying Large Language Models (LLMs) on resource-constrained edge devices faces critical bottlenecks in memory bandwidth and power consumption. While ternary quantization (e.g., BitNet b1.58) significantly reduces model size, its direct deployment on general-purpose hardware is hindered by workload imbalance, bandwidth-bound decoding, and strict data dependencies. To address these challenges, we propose \textbf{VitaLLM}, a hardware-software co-designed accelerator tailored for efficient ternary LLM inference. We introduce a heterogeneous \textbf{Dual-Core Compute Strategy} that synergizes specialized TINT-Cores for massive ternary projections with a unified BoothFlex-Core for mixed-precision attention, ensuring high utilization across both compute-bound prefill and bandwidth-bound decode stages. Furthermore, we develop a \textbf{Leading One Prediction (LOP)} mechanism to prune redundant Key-Value (KV) cache fetches and a \textbf{Dependency-Aware Scheduling} framework to hide the latency of nonlinear operations. Implemented in TSMC 16nm technology, VitaLLM achieves a decoding throughput of 70.70 tokens/s within an ultra-compact area of 0.223 mm$^2$ and a power consumption of 65.97 mW. The design delivers a superior Figure of Merit (FOM) of 17.4 TOPS/mm$^2$/W, significantly outperforming state-of-the-art accelerators. Finally, we explore an extended bit-serial design (BoothFlex-BS) to demonstrate the architecture's adaptability for precision-agile inference.
翻译:在资源受限的边缘设备上部署大语言模型面临内存带宽和功耗方面的关键瓶颈。虽然三元量化(如BitNet b1.58)显著减小了模型尺寸,但工作负载不均衡、带宽受限的解码以及严格的数据依赖关系阻碍了其在通用硬件上的直接部署。为应对这些挑战,我们提出了**VitaLLM**——一种面向高效三元LLM推理的软硬件协同设计加速器。我们引入了一种异构**双核计算策略**,将针对大规模三元投影的专用TINT-Cores与用于混合精度注意力的统一BoothFlex-Core协同工作,确保在计算密集的预填充阶段和带宽受限的解码阶段均实现高利用率。此外,我们开发了**前导一预测机制**以修剪冗余的键值缓存读取,以及一种**依赖感知调度**框架来隐藏非线性操作的延迟。VitaLLM采用台积电16nm工艺实现,在0.223 mm$^2$的超紧凑面积和65.97 mW功耗下实现了70.70 tokens/s的解码吞吐量。该设计实现了17.4 TOPS/mm$^2$/W的卓越品质因数,显著优于当前最先进的加速器。最后,我们探索了一种扩展的位串行设计(BoothFlex-BS),以展示该架构在精度可变的推理场景中的适应性。