Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds.
翻译:先进的编译器技术对于在新型硬件上运行机器学习应用至关重要,但传统编译器难以实现理想性能,流行的自动调优器搜索时间过长,而专家优化的库则带来了不可持续的成本。为此,我们开发了LoopTune——一种深度强化学习编译器,用于优化深度学习模型在CPU上的张量计算。LoopTune通过优化张量遍历顺序,并采用超轻量级快速代码生成器LoopNest执行硬件特定优化。基于新颖的图表示和动作空间设计,LoopTune将LoopNest的加速比提升至3.2倍,其生成的代码速度比TVM快一个数量级,比MetaSchedule快2.8倍,比AutoTVM快1.08倍,持续达到手工调优库Numpy的性能水平。此外,LoopTune可在数秒内完成代码调优。