The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest to the downstream models. Additionally, we design an adaptive keyframe inference switching algorithm tailored to different videos, capable of adapting to the current video content to jointly optimize accuracy and bandwidth. Through extensive experiments, our findings reveal that Arena can boost inference speeds by up to 1.58\(\times\) and 1.82\(\times\) on average while consuming only 47\% and 31\% of the bandwidth, respectively, all with high inference accuracy.
翻译:边缘计算的出现使得实时智能视频分析成为可能。先前基于传统模型架构(如CNN、RNN等)的研究采用多种策略过滤非感兴趣区域内容以最小化带宽与计算消耗,但在复杂环境下表现欠佳。近年来,基于Transformer的视觉基础模型凭借卓越的泛化能力,在复杂环境中展现出优异性能。然而,其庞大的计算需求限制了在实时智能视频分析中的应用。本文发现视觉基础模型(如Vision Transformer)同样具备适用于视频分析的专用加速机制。为此,我们提出Arena——一个基于ViT的端到端边缘辅助视频推理加速系统。该系统利用ViT可通过令牌剪枝实现加速的特性,仅将感兴趣区域块卸载并馈送至下游模型。此外,我们设计了适配不同视频的自适应关键帧推理切换算法,能够根据当前视频内容动态调整,协同优化精度与带宽。大量实验表明,Arena在保持高推理精度的前提下,推理速度平均提升达1.58倍与1.82倍,同时仅消耗47%与31%的带宽。