GNStor: Design of GPU-Native High-Performance Remote All-Flash Array

GPU has become the leading computing device for a wide range of data-intensive applications, which tightly collaborates with remote all-flash array (AFA) to accommodate ever-expanding datasets, facilitate multi-client data sharing, and guarantee fault tolerance. Although GPU is the center of computation, all I/O processes in existing GPU-AFA systems are still CPU-centric. CPU orchestrates remote I/O requests and executes a centralized AFA engine to take charge of AFA-level functionalities (e.g., access control and metadata persistence). This design disparity suffers from substantial CPU-GPU interaction overhead and I/O traffic amplification, compromising end-to-end I/O performance. In this work, we present \emph{GNStor}, a GPU-native AFA system that enables GPU to directly access remote AFA without CPU intervention in the I/O path, thereby fully exploiting the performance of AFA. Specifically, GNStor first proposes a GPU-centric NVMe over RDMA (NoR) software stack (named \emph{GNoR}), paving a fast path for GPUs to directly initiate NoR I/O requests to SSDs within remote AFA. GNoR employs an atomic-operation-based I/O orchestration design and follows the single-instruction-multiple-thread (SIMT) execution model of GPU, fully exploiting the massive parallelism of GPU architectures. To facilitate essential AFA functionalities in a CPU-bypass I/O path, GNStor further designs \emph{deEngine}, a decentralized AFA engine that seamlessly decomposes and integrates AFA-level tasks into each SSD firmware, thereby achieving efficient AFA access at low cost. Evaluation results show that GNStor achieves 3.2$\times$ higher I/O throughput and reduces application execution time by 31.1\%, compared to state-of-the-art AFA systems.

翻译：GPU已成为众多数据密集型应用的核心计算设备，其与远程全闪存阵列（AFA）紧密协作以承载持续扩展的数据集、支持多客户端数据共享并保障容错能力。尽管GPU承担计算中心角色，现有GPU-AFA系统的全部I/O流程仍以CPU为中心：CPU编排远程I/O请求并执行集中式AFA引擎以管理AFA层级功能（如访问控制与元数据持久化）。这种设计失配导致显著的CPU-GPU交互开销与I/O流量放大，损害端到端I/O性能。本文提出GNStor——一种GPU原生AFA系统，使GPU能够绕过CPU直接访问远程AFA，从而充分释放AFA性能潜力。具体而言，GNStor首先设计GPU中心化的NVMe over RDMA（NoR）软件栈GNoR，为GPU直接向远程AFA内SSD发起NoR I/O请求开辟快速通道。GNoR采用基于原子操作的I/O编排设计，并遵循GPU的单指令多线程（SIMT）执行模型，充分挖掘GPU架构的巨量并行性。为在CPU旁路I/O路径中实现关键AFA功能，GNStor进一步提出去中心化AFA引擎deEngine，通过将AFA层级任务无缝分解并集成至各SSD固件，以低开销实现高效AFA访问。评估结果显示，与当前最先进的AFA系统相比，GNStor的I/O吞吐量提升3.2倍，应用执行时间降低31.1%。