Deployment of real-time ML services on warehouse-scale infrastructures is on the increase. Therefore, decreasing latency and increasing throughput of deep neural network (DNN) inference applications that empower those services have attracted attention from both academia and industry. A common solution to address this challenge is leveraging hardware accelerators such as GPUs. To improve the inference throughput of DNNs deployed on GPU accelerators, two common approaches are employed: Batching and Multi-Tenancy. Our preliminary experiments show that the effect of these approaches on the throughput depends on the DNN architecture. Taking this observation into account, we design and implement DNNScaler which aims to maximize the throughput of interactive AI-powered services while meeting their latency requirements. DNNScaler first detects the suitable approach (Batching or Multi-Tenancy) that would be most beneficial for a DNN regarding throughput improvement. Then, it adjusts the control knob of the detected approach (batch size for Batching and number of co-located instances for Multi-Tenancy) to maintain the latency while increasing the throughput. Conducting an extensive set of experiments using well-known DNNs from a variety of domains, several popular datasets, and a cutting-edge GPU, the results indicate that DNNScaler can improve the throughput by up to 14x (218% on average) compared with the previously proposed approach, while meeting the latency requirements of the services.
翻译:在仓库级基础设施上部署实时机器学习服务的需求日益增长。因此,降低深度神经网络推理应用的延迟并提升其吞吐量已成为学术界和工业界关注的焦点。解决这一挑战的常见方法是利用GPU等硬件加速器。为提升部署在GPU加速器上的DNN推理吞吐量,通常采用两种方法:批量处理和多租户。我们的初步实验表明,这些方法对吞吐量的影响取决于DNN架构。基于这一发现,我们设计并实现了DNNScaler,旨在最大化交互式AI服务的吞吐量,同时满足其延迟要求。DNNScaler首先检测对DNN吞吐量提升最有利的合适方法(批量处理或多租户)。随后,它调整所检测方法的控制参数(批量处理的批大小和多租户的共置实例数),以在维持延迟的同时提升吞吐量。通过使用来自不同领域的知名DNN、多个流行数据集以及先进GPU进行大量实验,结果表明,与先前提出的方法相比,DNNScaler在满足服务延迟要求的同时,可将吞吐量提升最高达14倍(平均提升218%)。