We present a fully device-resident, multi-GPU architecture for the large-scale computational verification of Goldbach's conjecture. In prior work, a segmented double-sieve eliminated monolithic VRAM bottlenecks but remained constrained by host-side sieve construction and PCIe transfer latency. In this work, we migrate the entire segment generation pipeline to the GPU using highly optimised L1 shared-memory tiling, achieving near-zero host-device communication during the critical verification path. To fully leverage heterogeneous multi-GPU clusters, we introduce an asynchronous, lock-free work-stealing pool that replaces static workload partitioning with atomic segment claiming, enabling $99.7$% parallel efficiency at 2 GPUs and $98.6$% at $4$ GPUs. We further implement strict mathematical overflow guards guaranteeing the soundness of the 64-bit verification pipeline up to its theoretical ceiling of $1.84 \times 10^{19}$. On the same hardware, the new architecture achieves a $45.6\times$ algorithmic speedup over its host-coupled predecessor at N = $10^{10}$. End-to-end, the framework verifies Goldbach's conjecture up to $10^{12}$ in $36.5$ seconds on a single NVIDIA RTX 5090, and up to $10^{13}$ in $133.5$ seconds on a four-GPU system. All code is open-source and reproducible on commodity hardware.
翻译:我们提出了一种完全驻留于设备、基于多GPU的架构,用于哥德巴赫猜想的大规模计算验证。在先前的工作中,分段双筛法消除了整体式VRAM瓶颈,但仍受限于主机端的筛法构建和PCIe传输延迟。在本工作中,我们利用高度优化的L1共享内存分块技术,将整个分段生成流水线迁移至GPU,从而在关键验证路径上实现了近乎零的主机-设备通信。为了充分利用异构多GPU集群,我们引入了一种异步、无锁的工作窃取池,它用原子式分段认领取代了静态工作负载划分,在2个GPU上实现了$99.7$%的并行效率,在$4$个GPU上实现了$98.6$%的并行效率。我们进一步实现了严格的数学溢出防护,确保64位验证流水线在其理论上限$1.84 \times 10^{19}$范围内的正确性。在相同硬件上,新架构在N = $10^{10}$时,相比其主机耦合的前代版本实现了$45.6$倍的算法加速。端到端地,该框架在单个NVIDIA RTX 5090上以$36.5$秒验证了哥德巴赫猜想至$10^{12}$,在四GPU系统上以$133.5$秒验证至$10^{13}$。所有代码均已开源,可在商用硬件上复现。