六倍余量：基于DGX Spark平台的AI原生开放无线接入网LDPC加速 (Six Times to Spare: LDPC Acceleration on DGX Spark for AI-Native Open RAN)

Low-density parity-check (LDPC) decoding is one of the most computationally intensive kernels in the 5G New Radio (NR) physical layer and must complete within a 0.5\,ms transmission time interval while sharing the budget with FFT, channel estimation, demapping, HARQ, and MAC scheduling. Many open and proprietary stacks still execute LDPC on general-purpose CPUs, raising concerns about missed-slot events and limited scalability as bandwidths, modulation orders, and user multiplexing increase. This paper empirically quantifies the benefit of offloading 5G-style LDPC5G decoding from a Grace CPU to the integrated Blackwell GB10 GPU on an NVIDIA DGX~Spark platform. Using NVIDIA Sionna PHY/SYS on TensorFlow, we construct an NR-like link-level chain with an LDPC5G encoder/decoder, 16-QAM modulation, and AWGN, and sweep both the number of codewords decoded in parallel and the number of belief-propagation iterations, timing only the decoding phase while logging CPU and GPU utilization and power. Across the sweep we observe an average GPU/CPU throughput speedup of approximately $6\times$, with per-codeword CPU latency reaching $\approx 0.71$\,ms at 20 iterations (exceeding the 0.5\,ms slot), while the GB10 GPU remains within 6--24\% of the slot for the same workloads. Resource-usage measurements show that CPU-based LDPC decoding often consumes around ten Grace cores, whereas GPU-based decoding adds only $\approx10-15$\,W over GPU idle while leaving most CPU capacity available for higher-layer tasks. Because our implementation relies on high-level Sionna layers rather than hand-tuned CUDA, these results represent conservative lower bounds on achievable accelerator performance and provide a reusable, scriptable methodology for evaluating LDPC and other physical-layer kernels on future Grace/Blackwell and Aerial/ACAR/AODT platforms.

翻译：低密度奇偶校验（LDPC）解码是5G新空口（NR）物理层中计算最密集的核心任务之一，必须在0.5毫秒的传输时间间隔内完成，且需与快速傅里叶变换、信道估计、解映射、混合自动重传请求及媒体接入控制调度共享时间预算。目前许多开放与专有协议栈仍在通用中央处理器上执行LDPC解码，随着带宽、调制阶数与用户复用程度的提升，引发了时隙错失风险与可扩展性受限的担忧。本文通过实证研究，量化了在英伟达DGX Spark平台上将5G风格LDPC5G解码从Grace中央处理器卸载至集成Blackwell GB10图形处理器的性能增益。基于TensorFlow框架的英伟达Sionna物理层/系统层工具，我们构建了包含LDPC5G编码器/解码器、16-QAM调制及加性高斯白噪声的类NR链路级仿真链，系统扫描并行解码码字数量与置信传播迭代次数，在记录中央处理器与图形处理器利用率及功耗的同时，精确测量解码阶段耗时。实验结果表明：在参数扫描范围内，图形处理器相对中央处理器的平均吞吐量加速比约为$6\times$；当迭代20次时，中央处理器单码字解码延迟达到约0.71毫秒（超出0.5毫秒时隙限制），而GB10图形处理器在同等负载下仅占用时隙的6%至24%。资源使用测量显示：基于中央处理器的LDPC解码通常需占用约十个Grace核心，而基于图形处理器的解码方案仅使图形处理器在空闲功耗基础上增加约10-15瓦，同时释放了大部分中央处理器资源供高层协议处理。由于本实现基于高层Sionna模块而非手工优化的CUDA代码，所得结果可视为加速器性能的保守下界，并为未来评估Grace/Blackwell与Aerial/ACAR/AODT平台上的LDPC及其他物理层核心算法提供了可复用、可脚本化的方法论。