Algorithms based on spatial tree traversal are widely regarded as among the most efficient and flexible approaches for many problems in CPU-based high-performance computing (HPC). However, directly transferring these algorithms to GPU architectures often yields substantially smaller performance gains than expected in light of the high computational throughput of modern GPUs. The branching nature of tree algorithms leads to thread divergence and irregular memory access patterns -- both of which may severely limit GPU performance. To address these challenges, we propose a Morton (z-order) 'plane-based tree hierarchy' that is specifically designed for GPU architectures. The resulting flattened data layout enables efficient dual-tree traversal with collaborative execution across thread groups, leading to highly coalesced memory access patterns. Based on this framework we present implementations of two important spatial algorithms -- exact $k$-nearest neighbour search and friends-of-friends (FoF) clustering. For both cases, we observe more than an order-of-magnitude performance improvement over the closest competing GPU libraries for large problem sizes ($N \gtrsim 10^7$), together with strong scaling to distributed multi-GPU systems. We provide an open-source implementation, 'JZ-Tree' (JAX z-order tree), which serves as a foundation for efficient GPU implementations of a broad class of tree-based algorithms.
翻译:基于空间树遍历的算法被广泛认为是在CPU高性能计算中处理众多问题最高效、最灵活的方法之一。然而,将这些算法直接迁移至GPU架构时,由于其计算吞吐量远超预期,实际性能提升往往远小于预期。树算法的分支特性会导致线程发散和不规则内存访问模式,这两者均可能严重制约GPU性能。为解决上述挑战,我们提出一种专为GPU架构设计的莫顿(z-order)"平面树层级结构"。由此生成的扁平化数据布局支持跨线程组的协同双树遍历,实现了高度聚合的内存访问模式。基于该框架,我们实现了两种重要的空间算法——精确$k$近邻搜索与友邻聚类。针对大规模问题($N \gtrsim 10^7$),两种算法较同类最优GPU库均实现了超过一个数量级的性能提升,并展现出优异的分布式多GPU系统强扩展性。我们提供了开源实现"JZ树"(JAX z-order树),为基于树算法的广泛类别在GPU上的高效实现奠定了坚实基础。