LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM

Lookup tables (LUTs) have recently gained attention as an alternative compute mechanism that maps input operands to precomputed results, eliminating the need for arithmetic logic. LUTs not only reduce logic complexity, but also naturally support diverse numerical precisions without requiring separate circuits for each bitwidth-an increasingly important feature in quantized DNNs. This creates a favorable tradeoff in PIM: memory capacity can be used in place of logic to increase computational throughput, aligning well with DRAM-PIM architectures that offer high bandwidth and easily available memory but limited logic density. In this work, we explore this capacity-computation tradeoff in LUT-based PIM designs, where memory capacity is traded for performance by packing multiple MAC operations into a single LUT lookup. Building on this insight, we propose LOCALUT, a PIM-based design for efficient low-bit quantized DNN inference using operation-packed LUTs. First, we observe that these LUTs contain extensive redundancy and introduce LUT canonicalization, which eliminates duplicate entries to reduce LUT size. Second, we propose reordering LUT, a lightweight auxiliary LUT that remaps weight vectors to their canonical form required by LUT canonicalization with a simple LUT lookup. Third, we propose LUT slice streaming, a novel execution strategy that exploits the DRAM-buffer hierarchy by streaming only relevant LUT columns into the buffer and reusing them across multiple weight vectors. Evaluated on a real system based on UPMEM devices, we demonstrate a geometric mean speedup of 1.82x across various numeric precisions and DNN models. We believe LOCALUT opens a path toward scalable, low-logic PIM designs tailored for LUT-based DNN inference. Our implementation of LOCALUT is available at https://github.com/AIS-SNU/LoCaLUT.

翻译：查找表（LUT）作为一种替代计算机制近年来备受关注，它通过将输入操作数映射到预计算结果，从而省去了算术逻辑需求。LUT不仅降低了逻辑复杂度，还能天然支持多种数值精度，无需为每种位宽设计独立电路——这一特性在量化深度神经网络（DNN）中日益重要。这在PIM中形成了有利的权衡：存储容量可替代逻辑资源以提高计算吞吐量，这与带宽高、存储易得但逻辑密度有限的DRAM-PIM架构高度契合。本研究探索了基于LUT的PIM设计中这种容量-计算权衡——通过将多个乘加运算打包进单次LUT查找，以存储容量换取性能。基于这一洞察，我们提出LOCALUT，一种利用运算打包型LUT实现高效低比特量化DNN推理的PIM设计。首先，我们观察到这些LUT包含大量冗余，进而提出LUT规范化方法，通过消除重复表项缩减LUT体积。其次，我们提出重排序LUT——一种轻量级辅助查找表，通过简单查表操作将权重向量重新映射至LUT规范化所需的规范形式。第三，我们提出LUT切片流式执行策略，该策略利用DRAM-缓冲层级，仅将相关LUT列流式加载至缓冲区并在多个权重向量间复用。基于UPMEM设备搭建的实际系统评估表明，该方案在不同数值精度和DNN模型上实现几何平均加速1.82倍。我们认为LOCALUT为面向LUT基DNN推理的可扩展低逻辑PIM设计开辟了新路径。LOCALUT的开源实现见https://github.com/AIS-SNU/LoCaLUT。