The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.
翻译:移动人工智能加速器的兴起使得延迟敏感型应用可在客户端执行轻量级深度神经网络。然而,关键任务应用需要边缘设备无法承载的强模型,必须卸载请求,其中高维数据将竞争有限带宽。本研究提出摒弃关注执行已分区深度神经网络浅层的方法,转而主张将本地计算资源集中于面向机器可解释性的变分压缩。我们提出一种针对资源受限压缩模型的新型框架,并在反映边缘设备与服务器间非对称资源分布的环境中全面评估该方法。与现有最先进SC方法相比,该方法在保持准确率不变的情况下实现60%的码率降低,且比采用现有编解码标准卸载的方式快16倍。