We present VitaLLM, a mixed precision accelerator that enables ternary weight large language models to run efficiently on edge devices. The design combines two compute cores, a multiplier free TINT core for ternary-INT projections and a BoothFlex core that reuses a radix-4 Booth datapath for both INT8$\times$INT8 attention and ternary-INT-sustaining utilization without duplicating arrays. A predictive sparse attention mechanism employs a leading-one (LO) surrogate with a comparison-free top-$K$ selector to prune key/value (KV) fetches by roughly $1-K/M$ for $M$ cached tokens, confining exact attention to $K$ candidates. System-level integration uses head-level pipelining and an absmax-based quantization barrier to standardize cross-core interfaces and overlap nonlinear reductions with linear tiles. A 16 nm silicon prototype at 1 GHz/0.8 V achieves 72.46 tokens/s in decode and 0.88 s prefill (64 tokens) within 0.214 mm^2 and 120 KB on-chip memory, while reducing KV traffic and improving utilization in ablations. These results demonstrate practical BitNet b1.58 (3B) inference on edge-class platforms and provide a compact blueprint for future mixed-precision LLM accelerators.
翻译:我们提出VitaLLM——一种混合精度加速器,使三值权重大语言模型能在边缘设备上高效运行。该设计结合两个计算核:无乘法器的TINT核用于三值-INT投影,以及BoothFlex核重用基4布斯数据通路实现INT8×INT8注意力运算与三值-INT持续利用,无需阵列冗余。预测性稀疏注意力机制采用前导一(LO)替代量度与免比较top-K选择器,将键值(KV)获取量缩减约1-K/M(M为缓存令牌数),使精确注意力限定于K个候选。系统级集成通过头级流水线与基于绝对最大值的量化屏障标准化跨核接口,并将非线性规约与线性计算块重叠。16nm硅原型在1GHz/0.8V条件下,于0.214mm²面积和120KB片上存储器内实现解码阶段72.46 tokens/s、预填充阶段0.88s(64令牌),消融实验表明KV流量降低且利用率提升。该成果证明了BitNet b1.58(3B)在边缘级平台的实际推理能力,为未来混合精度大语言模型加速器提供紧凑化设计蓝图。