The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.
翻译:移动端人工智能加速器的兴起使得对延迟敏感的应用能够在客户端执行轻量级深度神经网络。然而,关键任务应用需要边缘设备无法承载的强大模型,因此必须卸载请求,此时高维数据将争夺有限带宽。本研究提出摒弃聚焦于分割深度神经网络浅层执行的思路,转而主张将本地计算资源集中于针对机器可解释性优化的变分压缩。我们提出了一种面向资源受限场景的压缩模型框架,并在反映边缘设备与服务器之间计算资源非对称分布的环境中进行了全面评估。相比现有最先进的语义通信方法,本方法在保持同等精度的前提下将比特率降低60%,且相比采用现有编解码标准的卸载方案加速高达16倍。