PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures

Processing-using-DRAM (PUD) architectures impose a restrictive data layout and alignment for their operands, where source and destination operands (i) must reside in the same DRAM subarray (i.e., a group of DRAM rows sharing the same row buffer and row decoder) and (ii) are aligned to the boundaries of a DRAM row. However, standard memory allocation routines (i.e., malloc, posix_memalign, and huge pages-based memory allocation) fail to meet the data layout and alignment requirements for PUD architectures to operate successfully. To allow the memory allocation API to influence the OS memory allocator and ensure that memory objects are placed within specific DRAM subarrays, we propose a new lazy data allocation routine (in the kernel) for PUD memory objects called PUMA. The key idea of PUMA is to use the internal DRAM mapping information together with huge pages and then split huge pages into finer-grained allocation units that are (i) aligned to the page address and size and (ii) virtually contiguous. We implement PUMA as a kernel module using QEMU and emulate a RISC-V machine running Fedora 33 with v5.9.0 Linux Kernel. We emulate the implementation of a PUD system capable of executing row copy operations (as in RowClone) and Boolean AND/OR/NOT operations (as in Ambit). In our experiments, such an operation is performed in the host CPU if a given operation cannot be executed in our PUD substrate (due to data misalignment). PUMA significantly outperforms the baseline memory allocators for all evaluated microbenchmarks and allocation sizes.

翻译：使用动态随机存取存储器进行计算（PUD）架构对其操作数的数据布局和对齐方式施加了严格限制，即源操作数和目标操作数必须（i）位于同一DRAM子阵列（即共享相同行缓冲区和行解码器的一组DRAM行）中，且（ii）对齐至DRAM行边界。然而，标准内存分配例程（如malloc、posix_memalign及基于大页的内存分配）无法满足PUD架构成功运行所需的数据布局与对齐要求。为使内存分配API能够影响操作系统内存分配器并确保内存对象放置于特定DRAM子阵列，我们针对PUD内存对象提出一种新的惰性数据分配例程（在内核中）——PUMA。PUMA的核心思想是：利用内部DRAM映射信息结合大页技术，将大页拆分为更细粒度的分配单元，这些单元（i）按页地址和大小对齐，且（ii）在虚拟地址上连续。我们基于QEMU将PUMA实现为内核模块，并模拟运行Fedora 33（Linux内核v5.9.0）的RISC-V机器。我们模拟实现了可执行行复制操作（如RowClone）及布尔AND/OR/NOT操作（如Ambit）的PUD系统。在实验中，若某项操作因数据未对齐而无法在PUD基底上执行，则在宿主机CPU上执行该操作。在全部评估微基准测试和分配规模中，PUMA均显著优于基线内存分配器。