一种采用智能网卡卸载的RDMA优先对象存储系统 (An RDMA-First Object Storage System with SmartNIC Offload)

AI training and inference impose sustained, fine-grain I/O that stresses host-mediated, TCP-based storage paths. Motivated by kernel-bypass networking and user-space storage stacks, we revisit POSIX-compatible object storage for GPU-centric pipelines. We present ROS2, an RDMA-first object storage system design that offloads the DAOS client to an NVIDIA BlueField-3 SmartNIC while leaving the DAOS I/O engine unchanged on the storage server. ROS2 separates a lightweight control plane (gRPC for namespace and capability exchange) from a high-throughput data plane (UCX/libfabric over RDMA or TCP) and removes host mediation from the data path. Using FIO/DFS across local and remote configurations, we find that on server-grade CPUs RDMA consistently outperforms TCP for both large sequential and small random I/O. When the RDMA-driven DAOS client is offloaded to BlueField-3, end-to-end performance is comparable to the host, demonstrating that SmartNIC offload preserves RDMA efficiency while enabling DPU-resident features such as multi-tenant isolation and inline services (e.g., encryption/decryption) close to the NIC. In contrast, TCP on the SmartNIC lags host performance, underscoring the importance of RDMA for offloaded deployments. Overall, our results indicate that an RDMA-first, SmartNIC-offloaded object-storage stack is a practical foundation for scaling data delivery in modern LLM training environments; integrating optional GPU-direct placement for LLM tasks is left for future work.

翻译：人工智能训练和推理任务施加了持续、细粒度的I/O负载，这对基于主机中介和TCP的存储路径构成了压力。受内核旁路网络和用户态存储栈的启发，我们重新审视了面向以GPU为中心流水线的POSIX兼容对象存储。我们提出了ROS2，一种RDMA优先的对象存储系统设计，它将DAOS客户端卸载到NVIDIA BlueField-3智能网卡上，同时保持存储服务器上的DAOS I/O引擎不变。ROS2将轻量级控制平面（用于命名空间和能力交换的gRPC）与高吞吐量数据平面（基于RDMA或TCP的UCX/libfabric）分离，并从数据路径中移除了主机中介。通过在本地和远程配置中使用FIO/DFS进行测试，我们发现，在服务器级CPU上，无论对于大块顺序I/O还是小块随机I/O，RDMA的性能始终优于TCP。当RDMA驱动的DAOS客户端被卸载到BlueField-3上时，端到端性能与主机端相当，这表明智能网卡卸载在保持RDMA效率的同时，还能实现靠近网卡的DPU驻留特性，例如多租户隔离和内联服务（如加密/解密）。相比之下，智能网卡上的TCP性能落后于主机端，这凸显了RDMA对于卸载部署的重要性。总体而言，我们的结果表明，一种RDMA优先、智能网卡卸载的对象存储栈，为在现代大语言模型训练环境中扩展数据交付提供了一个实用的基础；为LLM任务集成可选的GPU直接放置功能留待未来工作。