We propose a communication-efficient collaborative inference framework in the domain of edge inference, focusing on the efficient use of vision transformer (ViT) models. The partitioning strategy of conventional collaborative inference fails to reduce communication cost because of the inherent architecture of ViTs maintaining consistent layer dimensions across the entire transformer encoder. Therefore, instead of employing the partitioning strategy, our framework utilizes a lightweight ViT model on the edge device, with the server deploying a complicated ViT model. To enhance communication efficiency and achieve the classification accuracy of the server model, we propose two strategies: 1) attention-aware patch selection and 2) entropy-aware image transmission. Attention-aware patch selection leverages the attention scores generated by the edge device's transformer encoder to identify and select the image patches critical for classification. This strategy enables the edge device to transmit only the essential patches to the server, significantly improving communication efficiency. Entropy-aware image transmission uses min-entropy as a metric to accurately determine whether to depend on the lightweight model on the edge device or to request the inference from the server model. In our framework, the lightweight ViT model on the edge device acts as a semantic encoder, efficiently identifying and selecting the crucial image information required for the classification task. Our experiments demonstrate that the proposed collaborative inference framework can reduce communication overhead by 68% with only a minimal loss in accuracy compared to the server model on the ImageNet dataset.
翻译:本文提出一种面向边缘推理场景的通信高效协同推理框架,重点研究视觉Transformer(ViT)模型的高效利用。由于ViT固有的编码器层维度一致性架构,传统协同推理的模型划分策略难以有效降低通信开销。为此,本框架摒弃划分策略,转而在边缘设备部署轻量级ViT模型,服务器端部署复杂ViT模型。为提升通信效率并达到服务器模型的分类精度,我们提出两种策略:1)注意力感知的图像块选择;2)熵感知的图像传输。注意力感知图像块选择利用边缘设备Transformer编码器生成的注意力分数,识别并筛选对分类任务至关重要的图像块。该策略使边缘设备仅需向服务器传输关键图像块,显著提升通信效率。熵感知图像传输采用最小熵作为度量指标,精准判断应依赖边缘设备的轻量级模型还是请求服务器模型进行推理。在本框架中,边缘设备的轻量级ViT模型充当语义编码器,能高效识别并选择分类任务所需的关键图像信息。实验表明,在ImageNet数据集上,所提协同推理框架相比服务器模型仅产生微小精度损失,同时可降低68%的通信开销。