Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.
翻译:鉴于机器学习技术在移动设备上取得的显著进展,特别是在计算机视觉领域,本研究定量分析了190个实际应用的视觉Transformer(ViT)在移动设备上的性能特征。通过与102个实际应用的卷积神经网络(CNN)进行对比,我们揭示了影响移动设备上ViT架构延迟的关键因素。基于这些发现,我们构建了一个包含1000个合成ViT实测延迟的数据集,这些模型采用代表性构建模块和来自两个机器学习框架、六种移动平台的先进架构。利用该数据集,我们证明新ViT模型的推理延迟预测精度足以满足实际应用需求。