Non-Markovian (renewal) epidemic simulation on multi-million-node contact networks is essential for realistic forecasting under general age-dependent holding-time distributions (log-normal, Weibull, Erlang, and similar), but the age-dependent hazard forces dense per-step updates that render the sparse event-queue strategies of standard CPU methods ineffective. We present FlashSpread, a GPU framework that consolidates the per-step renewal pipeline (CSR traversal, numerically stable erfcx-based hazard evaluation, Bernoulli tau-leaping, state transition, and next-step infectivity write-back) into a single fused Triton kernel whose intermediates never leave streaming-multiprocessor registers, with block-scalar skips that preserve CUDA Graph capture and a degree-aware CSR dispatch (thread / warp / edge-merge) that keeps the peak throughput on scale-free graphs. On an NVIDIA A100 the fused CUDA-Graph engine reaches 8.09 Giga-NUPS at N = 10^6 on a uniform-degree graph, a 217x strict hardware speedup over optimised CPU tau-leaping at the same N; on a Barabasi-Albert graph of the same size the merge-based dispatch recovers 4.5x (0.45 to 2.0 Giga-NUPS) over the default kernel, and the framework scales to N = 10^8 on a single A100 (40 GB), with a mixed-precision storage path that extends the L2-reachable scale by roughly 3x and delivers a 1.15x throughput lift at the far bandwidth-bound end. Validation against an exact non-Markovian Gillespie reference shows a structural-bias floor of approximately 6% on peak infection and approximately 7% on final attack rate that does not detectably decrease as epsilon nears 0 across two decades of tolerance, comfortably within typical epidemiological parameter uncertainty. Code: https://github.com/Shakeri-Lab/FlashSpread.
翻译:针对百万级节点接触网络上的非马尔可夫(更新过程)流行病模拟,在一般依赖于年龄的停留时间分布(对数正态、威布尔、厄朗等分布)下,对于实现可靠的现实预测至关重要。然而,年龄依赖的风险函数迫使每步执行密集的全数据更新,这使得标准CPU方法中使用的稀疏事件队列策略失效。我们提出FlashSpread框架,该框架将每步更新流程(CSR遍历、基于erfcx的数值稳定风险函数评估、伯努利tau跳跃、状态转移及下一步感染性回写)整合到单个融合的Triton内核中,其中间结果始终驻留在流多处理器寄存器中;通过块标量跳过机制保留CUDA图的捕获能力,并采用度数感知的CSR分发策略(线程/线程束/边合并),在无标度图上保持峰值吞吐量。在NVIDIA A100上,对于均匀度图(N=10^6),融合CUDA图引擎达到8.09 Giga-NUPS,相较于同规模下优化的CPU tau跳跃实现217倍的严格硬件加速;对于相同规模的Barabasi-Albert图,基于合并的分发机制相比默认内核恢复4.5倍性能(从0.45提升至2.0 Giga-NUPS)。该框架可在单个A100(40 GB)上扩展到N=10^8,其混合精度存储路径将L2缓存可达规模扩展约3倍,并在远带宽受限端提供1.15倍的吞吐量提升。与精确非马尔可夫Gillespie参考算法的验证显示,峰值感染约6%的结构偏差和最终感染率约7%的结构偏差,在容差epsilon跨越两个数量级趋近于零的过程中未检测到明显下降,完全处于典型流行病学参数的不确定性范围内。代码:https://github.com/Shakeri-Lab/FlashSpread。