As an increasing number of businesses becomes powered by machine-learning, inference becomes a core operation, with a growing trend to be offered as a service. In this context, the inference task must meet certain service-level objectives (SLOs), such as high throughput and low latency. However, these targets can be compromised by interference caused by long- or short-lived co-located tasks. Prior works focus on the generic problem of co-scheduling to mitigate the effect of interference on the performance-critical task. In this work, we focus on inference pipelines and propose ODIN, a technique to mitigate the effect of interference on the performance of the inference task, based on the online scheduling of the pipeline stages. Our technique detects interference online and automatically re-balances the pipeline stages to mitigate the performance degradation of the inference task. We demonstrate that ODIN successfully mitigates the effect of interference, sustaining the latency and throughput of CNN inference, and outperforms the least-loaded scheduling (LLS), a common technique for interference mitigation. Additionally, it is effective in maintaining service-level objectives for inference, and it is scalable to large network models executing on multiple processing elements.
翻译:随着越来越多的业务由机器学习驱动,推理已成为核心操作,并呈现作为服务提供的增长趋势。在此背景下,推理任务必须满足特定的服务等级目标(SLO),例如高吞吐量和低延迟。然而,这些目标可能因长时或短时共存任务造成的干扰而受损。先前的研究聚焦于协同调度的通用问题,以减轻干扰对性能关键型任务的影响。本研究则针对推理流水线,提出ODIN技术——一种基于流水线阶段在线调度来减轻干扰对推理任务性能影响的方法。该技术在线检测干扰,并自动重新平衡流水线阶段,以缓解推理任务的性能退化。我们证明ODIN能有效减轻干扰影响,维持CNN推理的延迟和吞吐量,并优于常用于缓解干扰的最少负载调度(LLS)。此外,该方法能有效维护推理的服务等级目标,且可扩展至在多处理元件上执行的大型网络模型。