PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

On-device deployments of large language models (LLMs) are rapidly proliferating across mobile and edge platforms. LLM inference comprises a compute-intensive prefill phase and a memory bandwidth-intensive decode phase, and the decode phase has been widely recognized as well-suited to processing-in-memory (PIM) in both academia and industry. However, practical PIM-enabled systems face two obstacles between these phases, a memory attribute inconsistency in which prefill favors placing weights in a cacheable region for reuse whereas decode requires weights in a non-cacheable region to reliably trigger PIM, and a weight layout inconsistency between host-friendly and PIM-aware layouts. To address these problems, we introduce \textit{PIM-SHERPA}, a software-only method for efficient on-device LLM inference by resolving PIM memory attribute and layout inconsistencies. PIM-SHERPA provides two approaches, DRAM double buffering (DDB), which keeps a single PIM-aware weights in the non-cacheable region while prefetching the swizzled weights of the next layer into small cacheable buffers, and online weight rearrangement with swizzled memory copy (OWR), which performs the on-demand swizzled memory copy immediately before GEMM. Compared to a baseline PIM emulation system, PIM-SHERPA achieves approximately 47.8 - 49.7\% memory capacity savings while maintaining comparable performance to the theoretical maximum on the Llama 3.2 model. To the best of our knowledge, this is the first work to identify the memory attribute inconsistency and propose effective solutions on product-level PIM-enabled systems.

翻译：大语言模型（LLM）在移动和边缘平台的设备端部署正在迅速普及。LLM推理包含计算密集的预填充阶段和内存带宽密集的解码阶段，而解码阶段已被学术界和工业界广泛认为非常适合采用内存内处理（PIM）技术。然而，在实际的PIM使能系统中，这两个阶段之间存在两个障碍：一是内存属性不一致性，即预填充阶段倾向于将权重放置在可缓存区域以便重用，而解码阶段则需要将权重置于不可缓存区域以可靠地触发PIM操作；二是权重布局不一致性，即存在面向主机友好的布局与面向PIM优化的布局之间的差异。为解决这些问题，我们提出了**PIM-SHERPA**，这是一种纯软件方法，通过解决PIM内存属性与布局不一致性来实现高效的设备端LLM推理。PIM-SHERPA提供了两种方案：一是DRAM双缓冲（DDB），它在不可缓存区域仅保留一份PIM优化的权重，同时将下一层的重排权重预取到小的可缓存缓冲区中；二是在线权重重排与重排内存拷贝（OWR），它在通用矩阵乘法（GEMM）操作前立即执行按需的重排内存拷贝。与基线PIM仿真系统相比，PIM-SHERPA在Llama 3.2模型上实现了约47.8%至49.7%的内存容量节省，同时保持了与理论最大性能相当的水平。据我们所知，这是首个在产品级PIM使能系统上识别内存属性不一致性并提出有效解决方案的工作。