On the Impact of White-box Deployment Strategies for Edge AI on Latency and Model Performance

To help MLOps engineers decide which operator to use in which deployment scenario, this study aims to empirically assess the accuracy vs latency trade-off of white-box (training-based) and black-box operators (non-training-based) and their combinations in an Edge AI setup. We perform inference experiments including 3 white-box (i.e., QAT, Pruning, Knowledge Distillation), 2 black-box (i.e., Partition, SPTQ), and their combined operators (i.e., Distilled SPTQ, SPTQ Partition) across 3 tiers (i.e., Mobile, Edge, Cloud) on 4 commonly-used Computer Vision and Natural Language Processing models to identify the effective strategies, considering the perspective of MLOps Engineers. Our Results indicate that the combination of Distillation and SPTQ operators (i.e., DSPTQ) should be preferred over non-hybrid operators when lower latency is required in the edge at small to medium accuracy drop. Among the non-hybrid operators, the Distilled operator is a better alternative in both mobile and edge tiers for lower latency performance at the cost of small to medium accuracy loss. Moreover, the operators involving distillation show lower latency in resource-constrained tiers (Mobile, Edge) compared to the operators involving Partitioning across Mobile and Edge tiers. For textual subject models, which have low input data size requirements, the Cloud tier is a better alternative for the deployment of operators than the Mobile, Edge, or Mobile-Edge tier (the latter being used for operators involving partitioning). In contrast, for image-based subject models, which have high input data size requirements, the Edge tier is a better alternative for operators than Mobile, Edge, or their combination.

翻译：为帮助MLOps工程师在不同部署场景中选择合适的算子，本研究旨在通过实验评估边缘AI设置中白盒（基于训练）与黑盒（非基于训练）算子及其组合在精度与延迟之间的权衡关系。我们在4个常用计算机视觉与自然语言处理模型上，针对3个层级（移动端、边缘端、云端）执行推理实验，涵盖3种白盒算子（即QAT、剪枝、知识蒸馏）、2种黑盒算子（即分区、SPTQ）及其组合算子（即蒸馏SPTQ、SPTQ分区），从MLOps工程师视角识别有效策略。实验结果表明：当边缘端需要较低延迟且可接受中小幅度精度下降时，应优先选择蒸馏与SPTQ的组合算子（即DSPTQ）而非非混合算子。在非混合算子中，蒸馏算子在移动端和边缘端均为更优选择，能以中小幅度精度损失换取更低的延迟性能。此外，在资源受限层级（移动端、边缘端）中，涉及蒸馏的算子相比涉及跨移动端与边缘端分区的算子表现出更低的延迟。对于输入数据量需求较低的文本类主体模型，云端层级比移动端、边缘端或移动-边缘组合层级（后者用于涉及分区的算子）更适合算子部署。相反，对于输入数据量需求较高的图像类主体模型，边缘端层级比移动端、边缘端或其组合层级更适合算子部署。