The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest (PoIs) to the downstream models. Additionally, we employ probability-based patch sampling, which provides a simple but efficient mechanism for determining PoIs where the probable locations of objects are in subsequent frames. Through extensive evaluations on public datasets, our findings reveal that Arena can boost inference speeds by up to $1.58\times$ and $1.82\times$ on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.
翻译:摘要:边缘计算的出现使得实时智能视频分析成为可能。以往基于传统模型架构(如CNN、RNN等)的研究,通过过滤非感兴趣区域内容来最小化带宽和计算消耗,但在恶劣环境中表现欠佳。近期,基于Transformer的视觉基础模型因其卓越的泛化能力,在复杂环境下展现出了优异性能。然而,这些模型需要巨大的计算资源,限制了其在实时智能视频分析中的应用。本文发现Vision Transformer(ViT)等视觉基础模型同样具有针对视频分析的专用加速机制。为此,我们提出Arena——一个基于ViT的端到端边缘辅助视频推理加速系统。我们利用ViT可通过令牌剪枝加速的特性,仅将任务关注补丁(PoIs)卸载并输入下游模型。此外,我们采用基于概率的补丁采样方法,为确定后续帧中物体可能出现的区域提供了一种简单而高效的机制。通过在公开数据集上的广泛评估,结果表明Arena在保持高推理精度的同时,平均推理速度可提升1.58倍至1.82倍,带宽消耗分别仅为原先的54%和34%。