This paper presents an octree construction method, called Cornerstone, that facilitates global domain decomposition and interactions between particles in mesh-free numerical simulations. Our method is based on algorithms developed for 3D computer graphics, which we extend to distributed high performance computing (HPC) systems. Cornerstone yields global and locally essential octrees and is able to operate on all levels of tree hierarchies in parallel. The resulting octrees are suitable for supporting the computation of various kinds of short and long range interactions in N-body methods, such as Barnes-Hut and the Fast Multipole Method (FMM). While we provide a CPU implementation, Cornerstone may run entirely on GPUs. This results in significantly faster tree construction compared to execution on CPUs and serves as a powerful building block for the design of simulation codes that move beyond an offloading approach, where only numerically intensive tasks are dispatched to GPUs. With data residing exclusively in GPU memory, Cornerstone eliminates data movements between CPUs and GPUs. As an example, we employ Cornerstone to generate locally essential octrees for a Barnes-Hut treecode running on almost the full LUMI-G system with up to 8 trillion particles.
翻译:本文提出了一种名为Cornerstone的八叉树构建方法,该方法可促进无网格数值模拟中的全局域分解及粒子间相互作用。我们的方法基于为三维计算机图形学开发的算法,并将其扩展至分布式高性能计算(HPC)系统。Cornerstone能够生成全局八叉树和局部必要八叉树,并可在树层级的所有层次上并行操作。所生成的八叉树适用于支持N体方法中各类短程与长程相互作用的计算,例如Barnes-Hut算法和快速多极子方法(FMM)。虽然我们提供了CPU实现方案,但Cornerstone可完全在GPU上运行。相比CPU执行,这实现了显著更快的树构建速度,并成为设计超越“卸载”方法(即仅将数值密集型任务分配给GPU)的模拟代码的强大基础构件。由于数据仅驻留在GPU内存中,Cornerstone消除了CPU与GPU之间的数据传输。作为示例,我们运用Cornerstone为在几乎完整LUMI-G系统上运行、粒子数高达8万亿的Barnes-Hut树码生成了局部必要八叉树。