SemOpt: LLM-Driven Code Optimization via Rule-Based Analysis

Automated code optimization aims to improve performance in programs by refactoring code, and recent studies focus on utilizing LLMs for the optimization. Typical existing approaches mine optimization commits from open-source codebases to construct a large-scale knowledge base, then employ information retrieval techniques such as BM25 to retrieve relevant optimization examples for hotspot code locations, thereby guiding LLMs to optimize these hotspots. However, since semantically equivalent optimizations can manifest in syntactically dissimilar code snippets, current retrieval methods often fail to identify pertinent examples, leading to suboptimal optimization performance. This limitation significantly reduces the effectiveness of existing optimization approaches. To address these limitations, we propose SemOpt, a novel framework that leverages static program analysis to precisely identify optimizable code segments, retrieve the corresponding optimization strategies, and generate the optimized results. SemOpt consists of three key components: (1) A strategy library builder that extracts and clusters optimization strategies from real-world code modifications. (2) A rule generator that generates Semgrep static analysis rules to capture the condition of applying the optimization strategy. (3) An optimizer that utilizes the strategy library to generate optimized code results. All the three components are powered by LLMs. On our benchmark containing 151 optimization tasks, SemOpt demonstrates its effectiveness under different LLMs by increasing the number of successful optimizations by 1.38 to 28 times compared to the baseline. Moreover, on popular large-scale C/C++ projects, it can improve individual performance metrics by 5.04% to 218.07%, demonstrating its practical utility.

翻译：自动化代码优化旨在通过重构代码来提升程序性能，近期研究聚焦于利用大语言模型（LLM）进行优化。现有典型方法从开源代码库中挖掘优化提交以构建大规模知识库，随后采用BM25等检索技术为热点代码位置检索相关优化示例，从而引导LLM优化这些热点。然而，由于语义等价的优化可能表现为语法迥异的代码片段，现有检索方法往往无法识别相关示例，导致优化性能欠佳。这一局限显著降低了现有优化方法的有效性。为应对这些限制，我们提出SemOpt——一种创新框架，该框架利用静态程序分析精准识别可优化代码段、检索对应优化策略并生成优化结果。SemOpt包含三个核心组件：（1）策略库构建器：从实际代码修改中提取并聚类优化策略；（2）规则生成器：生成Semgrep静态分析规则以捕获应用优化策略的条件；（3）优化器：利用策略库生成优化后的代码结果。所有组件均由大语言模型驱动。在我们包含151项优化任务的基准测试中，SemOpt在不同大语言模型下均展现出卓越效果，其成功优化数量较基线提升1.38至28倍。此外，在主流大规模C/C++项目中，该框架可将单项性能指标提升5.04%至218.07%，充分证明了其实用价值。