FlashSpread: IO-Aware GPU Simulation of Non-Markovian Epidemic Dynamics via Kernel Fusion

Non-Markovian (renewal) epidemic simulation on multi-million-node contact networks is essential for realistic forecasting under general age-dependent holding-time distributions (log-normal, Weibull, Erlang, and similar), but the age-dependent hazard forces dense per-step updates that render the sparse event-queue strategies of standard CPU methods ineffective. We present FlashSpread, a GPU framework that consolidates the per-step renewal pipeline (CSR traversal, numerically stable erfcx-based hazard evaluation, Bernoulli tau-leaping, state transition, and next-step infectivity write-back) into a single fused Triton kernel whose intermediates never leave streaming-multiprocessor registers, with block-scalar skips that preserve CUDA Graph capture and a degree-aware CSR dispatch (thread / warp / edge-merge) that keeps the peak throughput on scale-free graphs. On an NVIDIA A100 the fused CUDA-Graph engine reaches 8.09 Giga-NUPS at N = 10^6 on a uniform-degree graph, a 217x strict hardware speedup over optimised CPU tau-leaping at the same N; on a Barabasi-Albert graph of the same size the merge-based dispatch recovers 4.5x (0.45 to 2.0 Giga-NUPS) over the default kernel, and the framework scales to N = 10^8 on a single A100 (40 GB), with a mixed-precision storage path that extends the L2-reachable scale by roughly 3x and delivers a 1.15x throughput lift at the far bandwidth-bound end. Validation against an exact non-Markovian Gillespie reference shows a structural-bias floor of approximately 6% on peak infection and approximately 7% on final attack rate that does not detectably decrease as epsilon nears 0 across two decades of tolerance, comfortably within typical epidemiological parameter uncertainty. Code: https://github.com/Shakeri-Lab/FlashSpread.

翻译：针对百万级节点接触网络上的非马尔可夫（更新过程）流行病模拟，在一般依赖于年龄的停留时间分布（对数正态、威布尔、厄朗等分布）下，对于实现可靠的现实预测至关重要。然而，年龄依赖的风险函数迫使每步执行密集的全数据更新，这使得标准CPU方法中使用的稀疏事件队列策略失效。我们提出FlashSpread框架，该框架将每步更新流程（CSR遍历、基于erfcx的数值稳定风险函数评估、伯努利tau跳跃、状态转移及下一步感染性回写）整合到单个融合的Triton内核中，其中间结果始终驻留在流多处理器寄存器中；通过块标量跳过机制保留CUDA图的捕获能力，并采用度数感知的CSR分发策略（线程/线程束/边合并），在无标度图上保持峰值吞吐量。在NVIDIA A100上，对于均匀度图（N=10^6），融合CUDA图引擎达到8.09 Giga-NUPS，相较于同规模下优化的CPU tau跳跃实现217倍的严格硬件加速；对于相同规模的Barabasi-Albert图，基于合并的分发机制相比默认内核恢复4.5倍性能（从0.45提升至2.0 Giga-NUPS）。该框架可在单个A100（40 GB）上扩展到N=10^8，其混合精度存储路径将L2缓存可达规模扩展约3倍，并在远带宽受限端提供1.15倍的吞吐量提升。与精确非马尔可夫Gillespie参考算法的验证显示，峰值感染约6%的结构偏差和最终感染率约7%的结构偏差，在容差epsilon跨越两个数量级趋近于零的过程中未检测到明显下降，完全处于典型流行病学参数的不确定性范围内。代码：https://github.com/Shakeri-Lab/FlashSpread。