An increasing number of applications rely on complex inference tasks that are based on machine learning (ML). Currently, there are two options to run such tasks: either they are served directly by the end device (e.g., smartphones, IoT equipment, smart vehicles), or offloaded to a remote cloud. Both options may be unsatisfactory for many applications: local models may have inadequate accuracy, while the cloud may fail to meet delay constraints. In this paper, we present the novel idea of inference delivery networks (IDNs), networks of computing nodes that coordinate to satisfy ML inference requests achieving the best trade-off between latency and accuracy. IDNs bridge the dichotomy between device and cloud execution by integrating inference delivery at the various tiers of the infrastructure continuum (access, edge, regional data center, cloud). We propose a distributed dynamic policy for ML model allocation in an IDN by which each node dynamically updates its local set of inference models based on requests observed during the recent past plus limited information exchange with its neighboring nodes. Our policy offers strong performance guarantees in an adversarial setting and shows improvements over greedy heuristics with similar complexity in realistic scenarios.
翻译:越来越多的应用依赖于基于机器学习(ML)的复杂推理任务。目前,运行此类任务有两种选择:由终端设备(如智能手机、物联网设备、智能车辆)直接服务,或卸载至远程云端。对于许多应用而言,这两种选择可能均不理想:本地模型可能精度不足,而云端可能无法满足延迟约束。在本文中,我们提出推理交付网络(IDN)这一创新概念,这是一种由计算节点组成的网络,通过协调满足机器学习推理请求,实现延迟与精度之间的最佳权衡。IDN通过将推理交付集成到基础设施连续体的各层级(接入层、边缘层、区域数据中心、云端),弥合了设备执行与云端执行之间的二元对立。我们提出一种用于IDN中ML模型分配的分布式动态策略,其中每个节点根据近期观察到的请求,并结合与邻近节点的有限信息交换,动态更新其本地推理模型集合。我们的策略在对抗性环境中提供了强性能保证,并在现实场景中展现出相较于具有相似复杂度的贪婪启发式方法的改进效果。