Parallel R-tree-based Spatial Query Processing on a Commercial Processing-in-Memory System

The growing volume of data in scientific domains has made spatial query processing increasingly challenging due to high data transfer costs across the memory hierarchy and limited memory bandwidth. To address these bottlenecks and reduce the energy consumed on data movement, this work explores Processing-in-Memory (PIM) systems by executing range queries directly inside memory chips. Unlike prior PIM studies centered on linear scans or hash-based queries, this work is the first to map R-tree range queries onto commercial PIM hardware. The proposed broadcast-based method constructs the R-tree bottom-up on the CPU, broadcasts top levels to UPMEM DPUs (DRAM Processing Units) for global filtering, and distributes lower levels for parallel batched queries in a CPU-DPU system. We evaluate our approach on two real spatial datasets, Sports (999K rectangles) and Lakes (8.4M rectangles), and assess scalability using a synthetic dataset with up to 16M rectangles and 3.9M queries on a commercial UPMEM PIM system with up to 2,540 DPUs. Across all datasets, broadcast-based execution consistently outperforms subtree partitioning by preventing communication from dominating execution. On the Lakes dataset, strong scaling from 512 to 2,540 DPUs reduces kernel time from 64.9 s to 17.6 s, yielding up to 3.66x kernel and 2.70x end-to-end speedup relative to the CPU R-tree search on the same system. The PIM kernel also consumes approximately 3.4x less energy than the corresponding CPU search (e.g., 59.6 kJ vs. 167.0 kJ on Lakes), demonstrating scalable and energy-efficient hierarchical spatial range queries.

翻译：科学领域数据量的持续增长使得空间查询处理面临严峻挑战，主要原因在于内存层级间高昂的数据传输成本和有限的内存带宽。为突破这些瓶颈并降低数据移动能耗，本文探索了内存计算（PIM）系统，通过直接在内存芯片内部执行范围查询来解决问题。不同于以往聚焦于线性扫描或哈希查询的PIM研究，本文首次将R树范围查询映射到商业PIM硬件上。所提出的基于广播的方法在CPU端自底向上构建R树，将顶层广播至UPMEM DPU（DRAM处理单元）进行全局过滤，并将底层分布至CPU-DPU系统中用于并行批量查询。我们使用两个真实空间数据集——Sports（99.9万个矩形）和Lakes（840万个矩形）——对方法进行评估，并通过包含最多1600万个矩形和390万个查询的合成数据集，在配备最多2540个DPU的商业UPMEM PIM系统上测试可扩展性。在所有数据集上，基于广播的执行方法通过避免通信主导执行过程，始终优于子树划分方法。以Lakes数据集为例，从512个DPU强扩展到2540个DPU时，内核时间从64.9秒降至17.6秒，相比同一系统上的CPU R树搜索，实现了最高3.66倍的内核加速比和2.70倍的端到端加速比。此外，PIM内核的能耗约为对应CPU搜索的1/3.4（例如在Lakes数据集上，59.6千焦对比167.0千焦），展示了可扩展且节能的分层空间范围查询能力。