In simulation of nuclear reactor physics using the Monte Carlo neutron transport method on GPUs, the sorting of particles plays a significant role in performance of calculation. Traditionally, CPUs and GPUs are separated devices connected at low data transfer rate and high data transfer latency. Emerging computing chips tend to integrate CPUs and GPUs. One example is the Apple silicon chips with unified memory. Such unified memory chips have opened doors for new strategies of collaboration between CPUs and GPUs for Monte Carlo neutron transport. Sorting particle on CPU and transport on GPU is an example of such new strategy, which has been suffering the high CPU-GPU data transfer latency on the traditional devices with separated CPU and GPU. The finding is that for the Apple M2 max chip, sorting on CPU leads to better performance per power than sorting on GPU for the ExaSMR whole core benchmark problems and the HTR-10 high temperature gas reactor fuel pebble problem. The partially sorted particle order has been identified to contribute to the higher performance with CPU sort than GPU. The in-house code using both CPU and GPU achieves 7.5 times power efficiency that of OpenMC on CPU for ExaSMR whole core benchmark with depleted fuel, and 150 times for HTR-10 fuel pebble benchmark with depleted fuel.
翻译:在GPU上使用蒙特卡罗中子输运方法模拟核反应堆物理时,粒子排序对计算性能具有重要影响。传统上,CPU与GPU是分离设备,通过低数据传输速率和高数据传输延迟的接口连接。新兴计算芯片倾向于集成CPU与GPU,采用统一内存架构的Apple silicon芯片即为典型实例。这种统一内存芯片为蒙特卡罗中子输运中CPU与GPU的协同策略开辟了新途径:在CPU端排序粒子并由GPU执行输运,正是传统分离式设备中受制于CPU-GPU高数据传输延迟的新型策略。研究发现,对于ExaSMR全堆芯基准问题及HTR-10高温气冷堆燃料球问题,在Apple M2 Max芯片上采用CPU排序可获得比GPU排序更优的每瓦特性能。经分析,部分有序的粒子序列有助于提升CPU排序相较于GPU排序的性能优势。采用CPU-GPU协同的自研代码,在贫化燃料全堆芯ExaSMR基准测试中,其能效达到OpenMC(CPU版本)的7.5倍;对于贫化燃料HTR-10燃料球基准测试,能效提升达150倍。