Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over $23\%$ while efficiently adapting to dynamic workloads.
翻译:机器学习(ML)模型分区在不同资源编排服务(如函数即服务(FaaS)和基础设施即服务(IaaS))间的动态卸载,能够在平衡处理与传输延迟的同时,最小化自适应推理应用的成本。然而,现有研究常忽略现实因素,如虚拟机(VM)冷启动、长尾服务时间分布下的请求等。为应对这些局限,我们将每个ML查询(请求)建模为遍历一个无环阶段序列,其中每个阶段由稀疏模型参数的连续块构成,并以内部或最终分类器结束,请求可能在此退出。由于输入相关的退出率存在差异,单一资源配置无法适应所有查询分布。当大量请求提前退出时,基于IaaS的虚拟机利用率不足,而快速扩展以处理到达深层结构的请求突发则不可行。SERFLOW通过利用基于FaaS的无服务器函数(容器),并采用考虑各阶段请求退出比例的阶段特异性资源供给策略,应对这一挑战。通过将此供给机制与基于请求摄入的虚拟机及无服务器函数间的自适应负载均衡相结合,SERFLOW在高效适应动态工作负载的同时,将云成本降低了超过23%。