Efficient deployment of neural networks (NN) requires the co-optimization of accuracy and latency. For example, hardware-aware neural architecture search has been used to automatically find NN architectures that satisfy a latency constraint on a specific hardware device. Central to these search algorithms is a prediction model that is designed to provide a hardware latency estimate for a candidate NN architecture. Recent research has shown that the sample efficiency of these predictive models can be greatly improved through pre-training on some \textit{training} devices with many samples, and then transferring the predictor on the \textit{test} (target) device. Transfer learning and meta-learning methods have been used for this, but often exhibit significant performance variability. Additionally, the evaluation of existing latency predictors has been largely done on hand-crafted training/test device sets, making it difficult to ascertain design features that compose a robust and general latency predictor. To address these issues, we introduce a comprehensive suite of latency prediction tasks obtained in a principled way through automated partitioning of hardware device sets. We then design a general latency predictor to comprehensively study (1) the predictor architecture, (2) NN sample selection methods, (3) hardware device representations, and (4) NN operation encoding schemes. Building on conclusions from our study, we present an end-to-end latency predictor training strategy that outperforms existing methods on 11 out of 12 difficult latency prediction tasks, improving latency prediction by 22.5\% on average, and up to to 87.6\% on the hardest tasks. Focusing on latency prediction, our HW-Aware NAS reports a $5.8\times$ speedup in wall-clock time. Our code is available on \href{https://github.com/abdelfattah-lab/nasflat_latency}{https://github.com/abdelfattah-lab/nasflat\_latency}.
翻译:神经网络的高效部署需要同时优化准确率与延迟。例如,硬件感知的神经架构搜索已被用于自动寻找满足特定硬件设备延迟约束的神经网络架构。这类搜索算法的核心是一个预测模型,用于估计候选神经网络架构在硬件上的运行延迟。最新研究表明,通过在包含大量样本的若干训练设备上进行预训练,再将预测器迁移至测试(目标)设备,可显著提升这些预测模型的样本效率。尽管迁移学习与元学习方法已用于此目的,但常表现出显著的性能变异性。此外,现有延迟预测器的评估主要基于人工构建的训练/测试设备集合,难以确定构成稳健通用延迟预测器的设计特征。为解决这些问题,我们通过自动化划分硬件设备集合,以规范化方式引入一套全面的延迟预测任务集合。继而设计通用延迟预测器系统研究:(1)预测器架构、(2)神经网络样本选择方法、(3)硬件设备表示方法及(4)神经网络操作编码方案。基于研究结论,我们提出端到端延迟预测器训练策略,在12项困难延迟预测任务中的11项上超越现有方法,平均延迟预测准确率提升22.5%,在最具挑战性任务中提升高达87.6%。聚焦延迟预测的硬件感知神经架构搜索实现5.8倍的墙上时钟时间加速。代码发布在:https://github.com/abdelfattah-lab/nasflat_latency