The rapid adoption of machine learning (ML) has underscored the importance of serving ML models with high throughput and resource efficiency. Traditional approaches to managing increasing query demands have predominantly focused on hardware scaling, which involves increasing server count or computing power. However, this strategy can often be impractical due to limitations in the available budget or compute resources. As an alternative, accuracy scaling offers a promising solution by adjusting the accuracy of ML models to accommodate fluctuating query demands. Yet, existing accuracy scaling techniques target independent ML models and tend to underperform while managing inference pipelines. Furthermore, they lack integration with hardware scaling, leading to potential resource inefficiencies during low-demand periods. To address the limitations, this paper introduces Loki, a system designed for serving inference pipelines effectively with both hardware and accuracy scaling. Loki incorporates an innovative theoretical framework for optimal resource allocation and an effective query routing algorithm, aimed at improving system accuracy and minimizing latency deadline violations. Our empirical evaluation demonstrates that through accuracy scaling, the effective capacity of a fixed-size cluster can be enhanced by more than $2.7\times$ compared to relying solely on hardware scaling. When compared with state-of-the-art inference-serving systems, Loki achieves up to a $10\times$ reduction in Service Level Objective (SLO) violations, with minimal compromises on accuracy and while fulfilling throughput demands.
翻译:机器学习的迅速普及凸显了以高吞吐量和资源效率提供机器学习模型服务的重要性。传统应对查询需求增长的方法主要侧重于硬件扩展,即增加服务器数量或计算能力。然而,由于可用预算或计算资源的限制,这种策略往往不切实际。作为替代方案,精度扩展通过调整机器学习模型的精度以适应波动的查询需求,提供了一种有前景的解决方案。然而,现有的精度扩展技术主要针对独立的机器学习模型,在管理推理流水线时往往表现不佳。此外,它们缺乏与硬件扩展的集成,导致在低需求期间可能出现资源效率低下的问题。为应对这些局限,本文介绍了Loki系统,该系统旨在通过硬件与精度双重扩展有效服务推理流水线。Loki包含一个创新的最优资源配置理论框架和一个高效的查询路由算法,旨在提升系统精度并最小化延迟期限违规。我们的实证评估表明,通过精度扩展,固定规模集群的有效容量相比仅依赖硬件扩展可提升超过$2.7\times$。与最先进的推理服务系统相比,Loki在满足吞吐量需求的同时,以最小的精度妥协实现了服务等级目标违规最多降低$10\times$。