Accelerating Machine Learning (ML) workloads requires efficient methods due to their large optimization space. Autotuning has emerged as an effective approach for systematically evaluating variations of implementations. Traditionally, autotuning requires the workloads to be executed on the target hardware (HW). We present an interface that allows executing autotuning workloads on simulators. This approach offers high scalability when the availability of the target HW is limited, as many simulations can be run in parallel on any accessible HW. Additionally, we evaluate the feasibility of using fast instruction-accurate simulators for autotuning. We train various predictors to forecast the performance of ML workload implementations on the target HW based on simulation statistics. Our results demonstrate that the tuned predictors are highly effective. The best workload implementation in terms of actual run time on the target HW is always within the top 3 % of predictions for the tested x86, ARM, and RISC-V-based architectures. In the best case, this approach outperforms native execution on the target HW for embedded architectures when running as few as three samples on three simulators in parallel.
翻译:加速机器学习(ML)工作负载因其庞大的优化空间而需要高效方法。自动调优已成为系统评估实现变体的有效途径。传统上,自动调优需要在目标硬件(HW)上执行工作负载。我们提出一种接口,允许在模拟器上执行自动调优工作负载。该方法在目标硬件可用性有限时提供高可扩展性,因为大量模拟可在任何可访问的硬件上并行运行。此外,我们评估了使用快速指令级精确模拟器进行自动调优的可行性。我们训练了多种预测器,基于模拟统计数据来预测ML工作负载实现在目标硬件上的性能。我们的结果表明,经调优的预测器非常有效。就目标硬件上的实际运行时间而言,最佳工作负载实现始终位于预测结果的前3%以内(针对测试的x86、ARM和基于RISC-V的架构)。在最佳情况下,当在三个模拟器上并行运行少至三个样本时,该方法在嵌入式架构上的表现优于目标硬件的本地执行。