In this paper, we investigate how to push the performance limits of serving Deep Neural Network (DNN) models on CPU-based servers. Specifically, we observe that while intra-operator parallelism across multiple threads is an effective way to reduce inference latency, it provides diminishing returns. Our primary insight is that instead of running a single instance of a model with all available threads on a server, running multiple instances each with smaller batch sizes and fewer threads for intra-op parallelism can provide lower inference latency. However, the right configuration is hard to determine manually since it is workload- (DNN model and batch size used by the serving system) and deployment-dependent (number of CPU cores on server). We present Packrat, a new serving system for online inference that given a model and batch size ($B$) algorithmically picks the optimal number of instances ($i$), the number of threads each should be allocated ($t$), and the batch sizes each should operate on ($b$) that minimizes latency. Packrat is built as an extension to TorchServe and supports online reconfigurations to avoid serving downtime. Averaged across a range of batch sizes, Packrat improves inference latency by 1.43$\times$ to 1.83$\times$ on a range of commonly used DNNs.
翻译:本文研究如何提升CPU服务器上深度神经网络(DNN)模型推理服务的性能极限。具体而言,我们观察到虽然跨多线程的算子内并行是降低推理延迟的有效手段,但其性能增益会逐渐递减。我们的核心洞见在于:相较在服务器上使用所有可用线程运行单个模型实例,运行多个实例(每个实例采用更小批处理量和更少线程用于算子内并行)反而能实现更低的推理延迟。然而,由于最佳配置取决于工作负载(服务系统采用的DNN模型与批处理量)和部署环境(服务器CPU核数),人工确定最优配置十分困难。为此,我们提出Packrat——一种面向在线推理的新型服务系统。给定模型与批处理量B后,该系统通过算法自动选取最优实例数i、各实例分配的线程数t及其处理的批处理量b,从而实现延迟最小化。Packrat作为TorchServe的扩展组件构建,支持在线重配置以避免服务停机。在一系列常用DNN上,Packrat在多种批处理量规模下平均可将推理延迟提升1.43倍至1.83倍。