The trend of modeless ML inference is increasingly growing in popularity as it hides the complexity of model inference from users and caters to diverse user and application accuracy requirements. Previous work mostly focuses on modeless inference in data centers. To provide low-latency inference, in this paper, we promote modeless inference at the edge. The edge environment introduces additional challenges related to low power consumption, limited device memory, and volatile network environments. To address these challenges, we propose HawkVision, which provides low-latency modeless serving of vision DNNs. HawkVision leverages a two-layer edge-DC architecture that employs confidence scaling to reduce the number of model options while meeting diverse accuracy requirements. It also supports lossy inference under volatile network environments. Our experimental results show that HawkVision outperforms current serving systems by up to 1.6X in P99 latency for providing modeless service. Our FPGA prototype demonstrates similar performance at certain accuracy levels with up to a 3.34X reduction in power consumption.
翻译:无模型机器学习推理的趋势日益流行,因为它向用户隐藏了模型推理的复杂性,并能满足不同用户和应用的精度要求。先前的工作主要集中在数据中心的无模型推理。为了提供低延迟推理,本文在边缘侧推进无模型推理。边缘环境引入了与低功耗、有限设备内存和波动网络环境相关的额外挑战。为应对这些挑战,我们提出了HawkVision,它为视觉深度神经网络提供低延迟的无模型服务。HawkVision利用两层边缘-数据中心架构,采用置信度缩放来减少模型选项数量,同时满足多样化的精度要求。它还在波动的网络环境下支持有损推理。我们的实验结果表明,在提供无模型服务时,HawkVision的P99延迟比现有服务系统最高可提升1.6倍。我们的FPGA原型在特定精度水平下展示了相近的性能,同时功耗最高降低了3.34倍。