Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent

Machine-learning models consist of kernels, which are algorithms applying operations on tensors -- data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel's optimization space. Kernel scheduling is the problem of finding the best implementation, given an objective function -- typically execution speed. Kernel optimizers such as Ansor, Halide, and AutoTVM solve this problem via search heuristics, which combine two phases: exploration and exploitation. The first step evaluates many different kernel optimization spaces. The latter tries to improve the best implementations by investigating a kernel within the same space. For example, Ansor combines kernel generation through sketches for exploration and leverages an evolutionary algorithm to exploit the best sketches. In this work, we demonstrate the potential to reduce Ansor's search time while enhancing kernel quality by incorporating Droplet Search, an AutoTVM algorithm, into Ansor's exploration phase. The approach involves limiting the number of samples explored by Ansor, selecting the best, and exploiting it with a coordinate descent algorithm. By applying this approach to the first 300 kernels that Ansor generates, we usually obtain better kernels in less time than if we let Ansor analyze 10,000 kernels. This result has been replicated in 20 well-known deep-learning models (AlexNet, ResNet, VGG, DenseNet, etc.) running on four architectures: an AMD Ryzen 7 (x86), an NVIDIA A100 tensor core, an NVIDIA RTX 3080 GPU, and an ARM A64FX. A patch with this combined approach was approved in Ansor in February 2024. As evidence of the generality of this search methodology, a similar patch, achieving equally good results, was submitted to TVM's MetaSchedule in June 2024.

翻译：机器学习模型由内核构成，即对张量（通过自然数线性组合索引的数据）执行运算的算法。内核的实例包括卷积、转置和向量积。内核的实现方式多种多样，这些实现构成了内核的优化空间。内核调度即在给定目标函数（通常为执行速度）的前提下寻找最优实现的问题。Ansor、Halide和AutoTVM等内核优化器通过结合两个阶段的搜索启发式方法解决该问题：探索与利用。第一阶段评估多个不同的内核优化空间，后者则通过在同一空间内深入研究内核来改进现有最优实现。例如，Ansor通过草图生成进行探索，并利用进化算法对最优草图进行利用。本研究表明，通过将AutoTVM算法Droplet Search融入Ansor的探索阶段，可在提升内核质量的同时减少搜索时间。该方法通过限制Ansor探索的样本数量，选取最优样本，并采用坐标下降算法进行深度利用。将此方法应用于Ansor生成的前300个内核时，通常能在较短时间内获得比分析10,000个内核更优的结果。该结论已在20个经典深度学习模型（AlexNet、ResNet、VGG、DenseNet等）于四种架构（AMD Ryzen 7 x86处理器、NVIDIA A100张量核心、NVIDIA RTX 3080 GPU及ARM A64FX）上的运行中得到复现。采用此混合方法的补丁已于2024年2月获Ansor采纳。作为该方法普适性的佐证，TVM的MetaSchedule也于2024年6月收到了实现同等优化效果的类似补丁。