The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60\% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.
翻译:移动端AI加速器的兴起使得延迟敏感型应用得以在客户端运行轻量级深度神经网络。然而,关键应用需要部署边缘设备无法承载的强模型,因此必须通过请求卸载的方式,使高维数据在有限带宽下竞争传输资源。本文提出转变研究方向,不再聚焦于分割后深度神经网络的浅层执行,而是主张将本地计算资源集中用于针对机器可解释性优化的变分压缩。我们提出了一种面向资源感知型压缩模型的新框架,并在反映边缘设备与服务器之间非对称资源分布的环境中对方法进行了全面评估。与现有最优的SC方法相比,本方法在精度不降低的情况下实现了60%的比特率压缩,且比采用现有编解码标准的卸载方案快至多16倍。