Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present TCloud, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. TCloud consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement TCloud based on a production-level NPU simulator. Our experiments show that TCloud improves the throughput of ML inference services by up to 1.4$\times$ and reduces the tail latency by up to 4.6$\times$, while improving the NPU utilization by 1.2$\times$ on average, compared to state-of-the-art NPU sharing approaches.
翻译:当前,云平台已开始部署神经处理单元(NPU)等硬件加速器以支持机器学习(ML)推理服务。为在确保合理服务质量的同时最大化资源利用率,一种自然的方法是对NPU进行虚拟化,从而为多租户ML服务实现高效的资源共享。然而,为现代云平台虚拟化NPU并非易事。这不仅是因为NPU硬件缺乏系统抽象支持,还由于现有架构与指令集架构(ISA)缺乏对虚拟化NPU细粒度动态算子调度的支持。本文提出TCloud,一个完整的NPU虚拟化框架。我们研究了跨越软硬件全栈的NPU虚拟化技术。TCloud包含:(1)一种称为vNPU的灵活NPU抽象,支持对物理NPU(pNPU)中异构计算单元进行细粒度虚拟化;(2)一个vNPU资源分配器,支持按使用量付费的计算模型以及灵活的vNPU到pNPU映射,以提高资源利用率和成本效益;(3)一种现代NPU架构的ISA扩展,用于促进多个vNPU间的细粒度张量算子调度。我们在一个生产级NPU模拟器上实现了TCloud。实验表明,与最先进的NPU共享方案相比,TCloud将ML推理服务的吞吐量最高提升1.4$\times$,尾部延迟最高降低4.6$\times$,同时平均将NPU利用率提高1.2$\times$。