Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent

Machine-learning models consist of kernels, which are algorithms applying operations on tensors -- data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel's optimization space. Kernel scheduling is the problem of finding the best implementation, given an objective function -- typically execution speed. Kernel optimizers such as Ansor, Halide, and AutoTVM solve this problem via search heuristics, which combine two phases: exploration and exploitation. The first step evaluates many different kernel optimization spaces. The latter tries to improve the best implementations by investigating a kernel within the same space. For example, Ansor combines kernel generation through sketches for exploration and leverages an evolutionary algorithm to exploit the best sketches. In this work, we demonstrate the potential to reduce Ansor's search time while enhancing kernel quality by incorporating Droplet Search, an AutoTVM algorithm, into Ansor's exploration phase. The approach involves limiting the number of samples explored by Ansor, selecting the best, and exploiting it with a coordinate descent algorithm. By applying this approach to the first 300 kernels that Ansor generates, we usually obtain better kernels in less time than if we let Ansor analyze 10,000 kernels. This result has been replicated in 20 well-known deep-learning models (AlexNet, ResNet, VGG, DenseNet, etc.) running on four architectures: an AMD Ryzen 7 (x86), an NVIDIA A100 tensor core, an NVIDIA RTX 3080 GPU, and an ARM A64FX. A patch with this combined approach was approved in Ansor in February 2024. As evidence of the generality of this search methodology, a similar patch, achieving equally good results, was submitted to TVM's MetaSchedule in June 2024.

翻译：机器学习模型由内核构成，这些内核是对张量（通过自然数线性组合索引的数据）执行操作的算法。内核的示例包括卷积、转置和向量积。实现内核的方法有很多种，这些实现构成了内核的优化空间。内核调度是在给定目标函数（通常是执行速度）的情况下寻找最佳实现的问题。Ansor、Halide和AutoTVM等内核优化器通过搜索启发式方法解决此问题，这些方法结合了两个阶段：探索与利用。第一步评估许多不同的内核优化空间；后者则通过在同一空间内深入研究内核来改进最佳实现。例如，Ansor通过草图进行内核生成以实现探索，并利用进化算法来优化最佳草图。在本工作中，我们展示了通过将AutoTVM算法Droplet Search融入Ansor的探索阶段，在提升内核质量的同时减少Ansor搜索时间的潜力。该方法包括限制Ansor探索的样本数量，选择最佳样本，并通过坐标下降算法进行利用。通过将此方法应用于Ansor生成的前300个内核，我们通常能在比Ansor分析10,000个内核更短的时间内获得更优的内核。这一结果已在20个著名深度学习模型（AlexNet、ResNet、VGG、DenseNet等）上得到复现，这些模型运行于四种架构：AMD Ryzen 7 (x86)、NVIDIA A100张量核心、NVIDIA RTX 3080 GPU和ARM A64FX。采用此组合方法的补丁已于2024年2月在Ansor中获得批准。作为该搜索方法通用性的证据，一个实现同等优良结果的类似补丁已于2024年6月提交至TVM的MetaSchedule。