Ternary weight quantization (e.g., BitNet b1.58) offers a promising path to mitigate the memory bandwidth bottleneck in Large Language Model (LLM) inference. However, conventional compute platforms lack native support for ternary-weight arithmetic, often relying on inefficient dequantization. Lookup table (LUT)-based hardware architectures provide an effective alternative by replacing multiplications with conditional additions, but their design space remains largely unexplored. Existing designs rely on heuristic parameter selection, lacking a systematic understanding of the architectural trade-offs. This work addresses this gap by formalizing the design space of ternary LUT-based accelerators and presenting an open-source hardware generator coupled with an analytical cost model, validated against synthesis in TSMC 16nm technology. By spanning the full architectural space, this framework not only enables rapid design space exploration but also establishes a common footing for fair cross-design evaluation, which was previously hindered by inconsistent instantiations across published accelerators. Using this framework, we challenge several assumptions and design choices in recent literature. We demonstrate that the optimal architecture is fundamentally governed by the activation data type: while LUT-based reuse offers significant gains for high-cost arithmetic (e.g., FP16), it yields diminishing returns for small integer types. Furthermore, we show that maximizing core size consistently improves area density compared to highly tiled approaches. Our optimized designs achieve a 2.2x area reduction compared to multiplier-based baselines. Moreover, by benchmarking state-of-the-art implementations against our model, we reveal that correcting suboptimal parameters yields up to a 1.2x area improvement.
翻译:三值权重量化(如BitNet b1.58)为缓解大语言模型推理中的内存带宽瓶颈提供了有前景的路径。然而,传统计算平台缺乏对三值权重算术的原生支持,往往依赖低效的反量化操作。基于查找表的硬件架构通过将乘法运算替换为条件加法提供了有效替代方案,但其设计空间仍待系统探索。现有设计依赖启发式参数选择,缺乏对架构权衡的系统性理解。本文通过形式化三值查找表加速器的设计空间,并开源集成了分析性成本模型的硬件生成器(基于TSMC 16nm工艺综合验证),弥补了这一空白。通过覆盖完整的架构空间,该框架不仅支持快速设计空间探索,还为公平的跨设计比较建立了共同基准——此前这一问题因不同加速器发布时采用不一致的实例化方案而难以实现。利用该框架,我们挑战了近期文献中的若干假设与设计选择。研究表明:最优架构本质上受激活数据类型主导——当运算成本较高时(如FP16),基于查找表的复用可带来显著收益,但对小整数类型则产生边际收益递减。此外,最大化核心面积相较于高度分块的方法能持续提升面积密度。与基于乘法器的基线相比,我们优化后的设计实现了2.2倍面积缩减。进一步,通过将现有最优实现与模型基准对比,我们发现修正次优参数可带来最高1.2倍面积效率提升。