Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in ML production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of accuracy and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling accuracy and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep-learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency SLAs using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Extensive experiments on a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves normalized accuracy by up to 35% with a minimal cost increase of less than 5%.
翻译:高效优化多模型推理流水线,在满足严格端到端延迟要求的同时实现快速、准确且经济高效的推理,是机器学习生产系统中的一个关键挑战。为简化对推理流水线中准确率与成本之间庞大且复杂权衡空间的探索,服务提供商常选择仅考虑其中之一。然而,难点在于协调准确率与成本之间的权衡。为应对这一挑战并提出有效管理推理流水线中模型变体的解决方案,我们提出了IPA——一种在线深度学习推理流水线自适应系统,能够高效地利用各深度学习任务的模型变体。模型变体是指针对同一深度学习任务、在资源需求、延迟和准确率上存在差异的不同预训练模型版本。IPA利用整数规划动态配置批处理大小、副本数量及模型变体,以优化准确率、最小化成本并满足用户定义的延迟服务等级协议。该系统支持多目标设定,可在准确率与成本目标间实现不同权衡,同时保持对可变工作负载和动态流量模式的适应性。在基于Kubernetes实现并采用五个真实推理流水线的广泛实验中,IPA将归一化准确率提升高达35%,而成本增加幅度低于5%。