Orthogonal time frequency space (OTFS) modulation offers superior robustness to high-mobility channels compared to conventional orthogonal frequency-division multiplexing (OFDM) waveforms. However, its explicit delay-Doppler (DD) domain representation incurs substantial signal processing complexity, especially with increased DD domain grid sizes. To address this challenge, we present a scalable, real-time Zak-OTFS receiver architecture on GPUs through hardware--algorithm co-design that exploits DD-domain channel sparsity. Our design leverages compact matrix operations for key processing stages, a branchless iterative equalizer, and a structured sparse channel matrix of the DD domain channel matrix to significantly reduce computational and memory overhead. These optimizations enable low-latency processing that consistently meets the 99.9-th percentile real-time processing deadline. The proposed system achieves up to 906.52 Mbps throughput with a DD grid size of (16384,32) using 16QAM modulation over 245.76 MHz bandwidth. Extensive evaluations under a Vehicular-A channel model demonstrate strong scalability and robust performance across CPU (Intel Xeon) and multiple GPU platforms (NVIDIA Jetson Orin, RTX 6000 Ada, A100, and H200), highlighting the effectiveness of compute-aware Zak-OTFS receiver design for next-generation (NextG) high-mobility communication systems.
翻译:正交时频空间(OTFS)调制相比传统正交频分复用(OFDM)波形,在高移动性信道中具有显著优越的鲁棒性。然而,其显式延迟-多普勒(DD)域表示会带来大量信号处理复杂度,尤其是在DD域网格尺寸增大时。为解决这一挑战,我们提出了一种基于GPU的、通过硬件-算法协同设计实现的、可扩展的实时Zak-OTFS接收机架构,该架构利用了DD域信道稀疏性。我们的设计通过为关键处理阶段采用紧凑矩阵运算、无分支迭代均衡器以及DD域信道矩阵的结构化稀疏表示,显著降低了计算和内存开销。这些优化使得低延迟处理成为可能,持续满足99.9百分位的实时处理截止时间。在245.76 MHz带宽上采用16QAM调制且DD网格尺寸为(16384,32)时,所提系统可实现高达906.52 Mbps的吞吐量。在Vehicular-A信道模型下的广泛评估表明,该系统在CPU(Intel Xeon)和多种GPU平台(NVIDIA Jetson Orin、RTX 6000 Ada、A100及H200)上均展现出强大的可扩展性和稳健性能,凸显了面向下一代(NextG)高移动性通信系统的计算感知型Zak-OTFS接收机设计的有效性。