REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

While model serving has unlocked unprecedented capabilities, the high cost of serving large-scale models continues to be a significant barrier to widespread accessibility and rapid innovation. Compiler optimizations have long driven substantial performance improvements, but existing compilers struggle with neural workloads due to the exponentially large and highly interdependent space of possible transformations. Although existing stochastic search techniques can be effective, they are often sample-inefficient and fail to leverage the structural context underlying compilation decisions. We set out to investigate the research question of whether reasoning with large language models (LLMs), without any retraining, can leverage the context-aware decision space of compiler optimizations to significantly improve sample efficiency. To that end, we introduce a novel compilation framework (dubbed REASONING COMPILER) that formulates optimization as a sequential, context-aware decision process guided by a large language model and structured Monte Carlo tree search (MCTS). The LLM acts as a proposal mechanism, suggesting hardware-informed transformations that reflect the current program state and accumulated performance feedback. MCTS incorporates the LLM-generated proposals to balance exploration and exploitation, facilitating a structured, context-sensitive traversal of the expansive compiler optimization space. By achieving substantial speedups with markedly fewer samples than leading neural compilers, our approach demonstrates the potential of LLM-guided reasoning to transform the landscape of compiler optimization.

翻译：尽管模型服务已释放出前所未有的能力，但大规模模型服务的高昂成本仍然是广泛可及性和快速创新的重大障碍。编译器优化长期以来推动了显著的性能提升，但现有编译器在处理神经网络工作负载时面临挑战，这是因为可能的变换空间呈指数级增长且高度相互依赖。虽然现有的随机搜索技术可能有效，但它们通常样本效率低下，且未能利用编译决策背后的结构上下文。我们着手研究以下科学问题：在不进行任何重新训练的情况下，利用大语言模型（LLMs）进行推理，能否利用编译器优化的上下文感知决策空间来显著提高样本效率？为此，我们提出了一种新颖的编译框架（命名为REASONING COMPILER），该框架将优化表述为一个由大语言模型和结构化蒙特卡洛树搜索（MCTS）引导的、序列化的、上下文感知的决策过程。大语言模型充当提议机制，根据当前程序状态和累积的性能反馈，提出硬件感知的变换建议。MCTS结合大语言模型生成的提议，以平衡探索与利用，促进对庞大编译器优化空间进行结构化、上下文敏感的遍历。与领先的神经编译器相比，我们的方法以显著更少的样本实现了大幅加速，这证明了大语言模型引导的推理在改变编译器优化格局方面的潜力。