Priority-Aware Model-Distributed Inference at Edge Networks

Distributed inference techniques can be broadly classified into data-distributed and model-distributed schemes. In data-distributed inference (DDI), each worker carries the entire Machine Learning (ML) model but processes only a subset of the data. However, feeding the data to workers results in high communication costs, especially when the data is large. An emerging paradigm is model-distributed inference (MDI), where each worker carries only a subset of ML layers. In MDI, a source device that has data processes a few layers of ML model and sends the output to a neighboring device, i.e., offloads the rest of the layers. This process ends when all layers are processed in a distributed manner. In this paper, we investigate the design and development of MDI when multiple data sources co-exist. We consider that each data source has a different importance and, hence, a priority. We formulate and solve a priority-aware model allocation optimization problem. Based on the structure of the optimal solution, we design a practical Priority-Aware Model- Distributed Inference (PA-MDI) algorithm that determines model allocation and distribution over devices by taking into account the priorities of different sources. Experiments were conducted on a real-life testbed of NVIDIA Jetson Xavier and Nano edge devices as well as in the Colosseum testbed with ResNet-50, ResNet- 56, and GPT-2 models. The experimental results show that PA-MDI performs priority-aware model allocation successfully while reducing the inference time as compared to baselines.

翻译：分布式推理技术可大致分为数据分布式与模型分布式两种方案。在数据分布式推理（DDI）中，每个工作节点承载完整的机器学习（ML）模型，但仅处理数据的一个子集。然而，将数据传输至工作节点会产生高昂的通信开销，尤其在数据规模较大时更为显著。模型分布式推理（MDI）作为一种新兴范式，其每个工作节点仅承载部分ML模型层。在MDI中，拥有数据的源设备处理若干ML模型层后，将输出结果发送至相邻设备（即卸载剩余层的计算）。该过程以分布式方式持续进行，直至所有模型层均被处理完毕。本文研究多数据源共存场景下MDI的设计与实现。我们假设每个数据源具有不同的重要性，因而对应不同的优先级。我们构建并求解了优先级感知的模型分配优化问题。基于最优解的结构特性，设计了一种实用的优先级感知模型分布式推理（PA-MDI）算法，该算法通过综合考虑不同数据源的优先级，确定设备间的模型分配与部署方案。实验在由NVIDIA Jetson Xavier与Nano边缘设备构成的真实测试平台，以及配备ResNet-50、ResNet-56和GPT-2模型的Colosseum测试平台上进行。实验结果表明，相较于基线方法，PA-MDI在成功实现优先级感知模型分配的同时，有效降低了推理时延。