On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance

Deciding what combination of operators to use across the Edge AI tiers to achieve specific latency and model performance requirements is an open question for MLOps engineers. This study aims to empirically assess the accuracy vs inference time trade-off of different black-box Edge AI deployment strategies, i.e., combinations of deployment operators and deployment tiers. In this paper, we conduct inference experiments involving 3 deployment operators (i.e., Partitioning, Quantization, Early Exit), 3 deployment tiers (i.e., Mobile, Edge, Cloud) and their combinations on four widely used Computer-Vision models to investigate the optimal strategies from the point of view of MLOps developers. Our findings suggest that Edge deployment using the hybrid Quantization + Early Exit operator could be preferred over non-hybrid operators (Quantization/Early Exit on Edge, Partition on Mobile-Edge) when faster latency is a concern at medium accuracy loss. However, when minimizing accuracy loss is a concern, MLOps engineers should prefer using only a Quantization operator on edge at a latency reduction or increase, respectively over the Early Exit/Partition (on edge/mobile-edge) and Quantized Early Exit (on edge) operators. In scenarios constrained by Mobile CPU/RAM resources, a preference for Partitioning across mobile and edge tiers is observed over mobile deployment. For models with smaller input data samples (such as FCN), a network-constrained cloud deployment can also be a better alternative than Mobile/Edge deployment and Partitioning strategies. For models with large input data samples (ResNet, ResNext, DUC), an edge tier having higher network/computational capabilities than Cloud/Mobile can be a more viable option than Partitioning and Mobile/Cloud deployment strategies.

翻译：如何选择跨边缘AI层级使用的算子组合以满足特定的延迟和模型性能要求，是MLOps工程师面临的一个开放性问题。本研究旨在通过实验评估不同黑盒边缘AI部署策略（即部署算子与部署层级的组合）的准确率与推理时间之间的权衡关系。我们针对四种广泛使用的计算机视觉模型，开展了涉及3种部署算子（即分割、量化、早退出）、3种部署层级（即移动端、边缘端、云端）及其组合的推理实验，以探究MLOps开发者视角下的最优策略。研究发现，当追求更快的延迟且可接受中等准确率损失时，采用混合量化+早退出算子的边缘端部署优于非混合算子（边缘端上的量化/早退出，移动端-边缘端上的分割）；而当需要最小化准确率损失时，MLOps工程师应优先选择仅在边缘端使用量化算子（其延迟减少或增加分别优于边缘端上的早退出/分割、移动端-边缘端上的分割，以及边缘端上的量化早退出算子）。在移动端CPU/RAM资源受限的场景下，跨移动端与边缘端的分割部署优于移动端部署。对于输入数据样本较小的模型（如FCN），受网络限制的云端部署可能比移动端/边缘端部署及分割策略更优；而对于输入数据样本较大的模型（ResNet、ResNext、DUC），网络/计算能力优于云端/移动端的边缘端层级，则比分割和移动端/云端部署策略更具可行性。