An important challenge in Machine Learning compilers like XLA is multi-pass optimization and analysis. There has been recent interest chiefly in XLA target-dependent optimization on the graph-level, subgraph-level, and kernel-level phases. We specifically focus on target-independent optimization XLA HLO pass ordering: our approach aims at finding the optimal sequence of compiler optimization passes, which is decoupled from target-dependent optimization. However, there is little domain specific study in pass ordering for XLA HLO. To this end, we propose introducing deep Reinforcement Learning (RL) based search for optimal XLA HLO pass ordering. We also propose enhancements to the deep RL algorithms to further improve optimal search performance and open the research direction for domain-specific guidance for RL. We create an XLA Gym experimentation framework as a tool to enable RL algorithms to interact with the compiler for passing optimizations and thereby train agents. Overall, in our experimentation we observe an average of $13.3\%$ improvement in operation count reduction on a benchmark of GPT-2 training graphs and $10.4\%$ improvement on a diverse benchmark including GPT-2, BERT, and ResNet graphs using the proposed approach over the compiler's default phase ordering.
翻译:机器学习编译器(如XLA)面临的重要挑战之一是多遍优化与分析。现有研究主要关注XLA在依赖目标设备的图中级、子图级和内核级优化阶段。本文重点研究靶标无关的XLA HLO优化遍序问题:我们的方法旨在寻找与靶标相关优化解耦的编译器优化遍序最优序列。然而,针对XLA HLO遍序的领域专项研究尚属空白。为此,我们提出引入深度强化学习(RL)搜索最优XLA HLO优化遍序,并进一步提出深度强化学习算法改进方案以提升优化搜索性能,为领域引导的强化学习开辟研究方向。我们构建了XLA Gym实验框架,使强化学习算法能够与编译器交互实现优化遍序训练。实验表明,在GPT-2训练图基准测试中,所提方法相较编译器默认遍序实现了平均$13.3\%$的运算操作数减少;在包含GPT-2、BERT和ResNet图的多样化基准测试中,操作数减少$10.4\%$。