Edge-assisted mobile video analytics (MVA) applications are increasingly shifting from using vision models based on convolutional neural networks (CNNs) to those built on vision transformers (ViTs) to leverage their superior global context modeling and generalization capabilities. However, deploying these advanced models in latency-critical MVA scenarios presents significant challenges. Unlike traditional CNN-based offloading paradigms where network transmission is the primary bottleneck, ViT-based systems are constrained by substantial inference delays, particularly for dense prediction tasks where the need for high-resolution inputs exacerbates the inherent quadratic computational complexity of ViTs. To address these challenges, we propose a dynamic mixed-resolution inference strategy tailored for ViT-backboned dense prediction models, enabling flexible runtime trade-offs between speed and accuracy. Building on this, we introduce ViTMAlis, a ViT-native device-to-edge offloading framework that dynamically adapts to network conditions and video content to jointly reduce transmission and inference delays. We implement a fully functional prototype of ViTMAlis on commodity mobile and edge devices. Extensive experiments demonstrate that, compared to state-of-the-art accuracy-centric, content-aware, and latency-adaptive baselines, ViTMAlis significantly reduces end-to-end offloading latency while improving user-perceived rendering accuracy, providing a practical foundation for next-generation mobile intelligence.
翻译:边缘辅助的移动视频分析应用正日益从基于卷积神经网络的视觉模型转向基于视觉Transformer的模型,以利用其卓越的全局上下文建模与泛化能力。然而,在延迟关键型移动视频分析场景中部署这些先进模型面临重大挑战。与传统以网络传输为主要瓶颈的CNN卸载范式不同,基于ViT的系统受限于显著的计算延迟,尤其在密集预测任务中,高分辨率输入需求加剧了ViT固有的二次计算复杂度。为应对这些挑战,我们提出一种专为ViT骨干密集预测模型设计的动态混合分辨率推理策略,实现运行时速度与精度的灵活权衡。在此基础上,我们提出ViTMAlis——一个原生适配ViT的端边协同卸载框架,能够动态适应网络条件与视频内容,协同降低传输与计算延迟。我们在商用移动与边缘设备上实现了ViTMAlis的全功能原型系统。大量实验表明,相较于最先进的精度优先型、内容感知型及延迟自适应基线方法,ViTMAlis在提升用户感知渲染精度的同时,显著降低了端到端卸载延迟,为下一代移动智能系统提供了实用化基础。