The rapid growth of AI-generated content (AIGC) has enabled high-quality creative production across diverse domains, yet existing systems face critical inefficiencies in throughput, resource utilization, and scalability under concurrent workloads. This paper introduces OnePiece, a large-scale distributed inference system with RDMA optimized for multi-stage AIGC workflows. By decomposing pipelines into fine-grained microservices and leveraging one-sided RDMA communication, OnePiece significantly reduces inter-node latency and CPU overhead while improving GPU utilization. The system incorporates a novel double-ring buffer design to resolve deadlocks in RDMA-aware memory access without CPU involvement. Additionally, a dynamic Node Manager allocates resources elastically across workflow stages in response to real-time load. Experimental results demonstrate that OnePiece reduces GPU resource consumption by 16x in Wan2.1 image-to-video generation compared to monolithic inference pipelines, offering a scalable, fault-tolerant, and efficient solution for production AIGC environments.
翻译:AI生成内容(AIGC)的快速发展推动了跨领域高质量创意内容的生成,然而现有系统在并发工作负载下面临吞吐量、资源利用率和可扩展性方面的严重效率瓶颈。本文提出OnePiece,一种针对多阶段AIGC工作流优化的、基于RDMA的大规模分布式推理系统。通过将流水线分解为细粒度微服务并利用单边RDMA通信,OnePiece显著降低了节点间通信延迟与CPU开销,同时提升了GPU利用率。系统采用创新的双环形缓冲区设计,在无需CPU介入的情况下解决了RDMA感知内存访问中的死锁问题。此外,动态节点管理器可根据实时负载弹性分配跨工作流阶段的资源。实验结果表明,在Wan2.1图像到视频生成任务中,相较于单体推理流水线,OnePiece将GPU资源消耗降低了16倍,为生产级AIGC环境提供了可扩展、容错且高效的解决方案。