This paper studies the computational offloading of video action recognition in edge computing. To achieve effective semantic information extraction and compression, following semantic communication we propose a novel spatiotemporal attention-based autoencoder (STAE) architecture, including a frame attention module and a spatial attention module, to evaluate the importance of frames and pixels in each frame. Additionally, we use entropy encoding to remove statistical redundancy in the compressed data to further reduce communication overhead. At the receiver, we develop a lightweight decoder that leverages a 3D-2D CNN combined architecture to reconstruct missing information by simultaneously learning temporal and spatial information from the received data to improve accuracy. To fasten convergence, we use a step-by-step approach to train the resulting STAE-based vision transformer (ViT_STAE) models. Experimental results show that ViT_STAE can compress the video dataset HMDB51 by 104x with only 5% accuracy loss, outperforming the state-of-the-art baseline DeepISC. The proposed ViT_STAE achieves faster inference and higher accuracy than the DeepISC-based ViT model under time-varying wireless channel, which highlights the effectiveness of STAE in guaranteeing higher accuracy under time constraints.
翻译:本文研究边缘计算中视频动作识别的计算卸载问题。为实现有效的语义信息提取与压缩,遵循语义通信框架,我们提出一种新型的基于时空注意力的自编码器(STAE)架构,包含帧注意力模块和空间注意力模块,用于评估各帧及帧内像素的重要性。此外,采用熵编码去除压缩数据中的统计冗余以进一步降低通信开销。在接收端,我们设计了一个轻量化解码器,通过3D-2D CNN混合架构,同步学习接收数据中的时序与空间信息来重建缺失信息,从而提高识别精度。为加速收敛,我们采用分步训练方法对基于STAE的视觉Transformer(ViT_STAE)模型进行训练。实验结果表明,ViT_STAE可将视频数据集HMDB51压缩104倍,且精度仅损失5%,性能优于当前最先进的基准模型DeepISC。在时变无线信道条件下,所提出的ViT_STAE相比基于DeepISC的ViT模型实现了更快的推理速度和更高的识别精度,这凸显了STAE在时间约束下保证高精度的有效性。