异构计算单元：面向商用云原生应用的XPU加速基础设施卸载 (HeteroPod: XPU-Accelerated Infrastructure Offloading for Commodity Cloud-Native Applications)

Cloud-native systems increasingly rely on infrastructure services (e.g., service meshes, monitoring agents), which compete for resources with user applications, degrading performance and scalability. We propose HeteroPod, a new abstraction that offloads these services to Data Processing Units (DPUs) to enforce strict isolation while reducing host resource contention and operational costs. To realize HeteroPod, we introduce HeteroNet, a cross-PU (XPU) network system featuring: (1) split network namespace, a unified network abstraction for processes spanning CPU and DPU, and (2) elastic and efficient XPU networking, a communication mechanism achieving shared-memory performance without pinned resource overhead and polling costs. By leveraging HeteroNet and the compositional nature of cloud-native workloads, HeteroPod can optimally offload infrastructure containers to DPUs. We implement HeteroNet based on Linux, and implement a cloud-native system called HeteroK8s based on Kubernetes. We evaluate the systems using NVIDIA Bluefield-2 DPUs and CXL-based DPUs (simulated with real CXL memory devices). The results show that HeteroK8s effectively supports complex (unmodified) commodity cloud-native applications (up to 1 million LoC) and provides up to 31.9x better latency and 64x less resource consumption (compared with kernel-bypass design), 60% better end-to-end latency, and 55% higher scalability compared with SOTA systems.

翻译：云原生系统日益依赖基础设施服务（如服务网格、监控代理），这些服务与用户应用竞争资源，导致性能与可扩展性下降。本文提出异构计算单元——一种将此类服务卸载至数据处理单元的新抽象，在实现严格隔离的同时降低主机资源争用与运营成本。为实现该抽象，我们设计了异构网络——一种跨处理单元的异构网络系统，其具备两大核心特性：（1）分裂网络命名空间——为跨越CPU与DPU的进程提供统一网络抽象；（2）弹性高效异构网络——无需固定资源开销与轮询成本即可实现共享内存性能的通信机制。通过利用异构网络及云原生工作负载的组合特性，异构计算单元可将基础设施容器最优卸载至DPU。我们基于Linux实现了异构网络，并基于Kubernetes构建了名为异构K8s的云原生系统。使用NVIDIA Bluefield-2 DPU及基于CXL的DPU（通过真实CXL内存设备模拟）进行评估，结果表明：异构K8s能有效支持复杂（未经修改的）商用云原生应用（最高达100万行代码），与内核旁路设计相比可提升31.9倍延迟性能并减少64倍资源消耗，与现有最优系统相比可降低60%端到端延迟并提升55%可扩展性。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日