Vision Transformers have enabled recent attention-based Deep Learning (DL) architectures to achieve remarkable results in Computer Vision (CV) tasks. However, due to the extensive computational resources required, these architectures are rarely implemented on resource-constrained platforms. Current research investigates hybrid handcrafted convolution-based and attention-based models for CV tasks such as image classification and object detection. In this paper, we propose HyT-NAS, an efficient Hardware-aware Neural Architecture Search (HW-NAS) including hybrid architectures targeting vision tasks on tiny devices. HyT-NAS improves state-of-the-art HW-NAS by enriching the search space and enhancing the search strategy as well as the performance predictors. Our experiments show that HyT-NAS achieves a similar hypervolume with less than ~5x training evaluations. Our resulting architecture outperforms MLPerf MobileNetV1 by 6.3% accuracy improvement with 3.5x less number of parameters on Visual Wake Words.
翻译:视觉Transformer使得近期基于注意力机制的深度学习架构在计算机视觉任务中取得了显著成果。然而,由于需要大量计算资源,这些架构很少在资源受限平台上实现。当前研究正在探索针对图像分类和目标检测等计算机视觉任务的混合手工卷积与注意力模型。本文提出HyT-NAS——一种高效的硬件感知神经架构搜索方法,包含面向微型设备视觉任务的混合架构。HyT-NAS通过丰富搜索空间、优化搜索策略及性能预测器,改进了现有最先进的硬件感知神经架构搜索方法。实验表明,HyT-NAS在训练评估次数减少约5倍的情况下实现了相当的超体积指标。我们的最终架构在Visual Wake Words数据集上,以参数数量减少3.5倍的优势,相较MLPerf MobileNetV1取得6.3%的准确率提升。