Enabling Population-Level Parallelism in Tree-Based Genetic Programming for GPU Acceleration

Tree-based Genetic Programming (TGP) is a widely used evolutionary algorithm for tasks such as symbolic regression, classification, and robotic control. Due to the intensive computational demands of running TGP, GPU acceleration is crucial for achieving scalable performance. However, efficient GPU-based execution of TGP remains challenging, primarily due to three core issues: (1) the structural heterogeneity of program individuals, (2) the complexity of integrating multiple levels of parallelism, and (3) the incompatibility between high-performance CUDA execution and flexible Python-based environments. To address these issues, we propose EvoGP, a high-performance framework tailored for GPU acceleration of TGP via population-level parallel execution. First, EvoGP introduces a tensorized representation that encodes variable-sized trees into fixed-shape, memory-aligned arrays, enabling uniform memory access and parallel computation across diverse individuals. Second, EvoGP adopts an adaptive parallelism strategy that dynamically combines intra- and inter-individual parallelism based on dataset size, ensuring high GPU utilization across a broad spectrum of tasks. Third, EvoGP embeds custom CUDA kernels into the PyTorch runtime, achieving seamless integration with Python-based environments such as Gym, MuJoCo, Brax, and Genesis. Experimental results demonstrate that EvoGP achieves a peak throughput exceeding $10^{11}$ GPops/s. Specifically, this performance represents a speedup of up to $304\times$ over existing GPU-based TGP implementations and $18\times$ over state-of-the-art CPU-based libraries. Furthermore, EvoGP maintains comparable accuracy and exhibits improved scalability across large population sizes. EvoGP is open source and accessible at: https://github.com/EMI-Group/evogp.

翻译：基于树的遗传编程是一种广泛应用于符号回归、分类和机器人控制等任务的进化算法。由于运行TGP的计算需求密集，GPU加速对于实现可扩展性能至关重要。然而，基于GPU的高效TGP执行仍然面临挑战，主要源于三个核心问题：(1)程序个体的结构异质性，(2)多层次并行化集成的复杂性，以及(3)高性能CUDA执行与灵活的基于Python环境之间的不兼容性。为解决这些问题，我们提出了EvoGP，一个专为通过群体级并行执行实现TGP的GPU加速而设计的高性能框架。首先，EvoGP引入了一种张量化表示法，将可变大小的树编码为固定形状、内存对齐的数组，从而实现对不同个体的统一内存访问和并行计算。其次，EvoGP采用了一种自适应并行策略，根据数据集大小动态结合个体内和个体间并行化，确保在广泛的任务范围内实现高GPU利用率。第三，EvoGP将自定义CUDA内核嵌入到PyTorch运行时中，实现了与基于Python的环境（如Gym、MuJoCo、Brax和Genesis）的无缝集成。实验结果表明，EvoGP实现了超过$10^{11}$ GPops/s的峰值吞吐量。具体而言，该性能相对于现有的基于GPU的TGP实现实现了高达$304\times$的加速，相对于最先进的基于CPU的库实现了$18\times$的加速。此外，EvoGP保持了相当的准确性，并在大规模群体规模下表现出更好的可扩展性。EvoGP是开源的，可通过以下网址访问：https://github.com/EMI-Group/evogp。