Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds.
翻译:先进的编译器技术对于使机器学习应用能在新型硬件上运行至关重要,但传统编译器无法实现性能优化,流行的自动调优器搜索时间过长,而专家优化的库则会带来不可持续的成本。针对这些问题,我们开发了LoopTune——一种用于CPU上深度学习模型中张量计算优化的深度强化学习编译器。LoopTune在利用超快速轻量级代码生成器LoopNest执行硬件特定优化的同时,优化张量遍历顺序。通过新颖的基于图的表示方法和动作空间,LoopTune使LoopNest的提速达3.2倍,生成的代码比TVM快一个数量级,比MetaSchedule快2.8倍,比AutoTVM快1.08倍,性能始终达到手工调优库Numpy的水平。此外,LoopTune在数秒内即可完成代码调优。