Superoptimization is the task of transforming a program into a faster one while preserving its input-output behavior. In this work, we investigate whether large language models (LLMs) can serve as superoptimizers, generating assembly programs that outperform code already optimized by industry-standard compilers. We construct the first large-scale benchmark for this problem, consisting of 8,072 assembly programs averaging 130 lines, in contrast to prior datasets restricted to 2-15 straight-line, loop-free programs. We evaluate 23 LLMs on this benchmark and find that the strongest baseline, Claude-opus-4, achieves a 51.5% test-passing rate and a 1.43x average speedup over gcc -O3. To further enhance performance, we fine-tune models with reinforcement learning, optimizing a reward function that integrates correctness and performance speedup. Starting from Qwen2.5-Coder-7B-Instruct (61.4% correctness, 1.10x speedup), the fine-tuned model SuperCoder attains 95.0% correctness and 1.46x average speedup, with additional improvement enabled by Best-of-N sampling and iterative refinement. Our results demonstrate, for the first time, that LLMs can be applied as superoptimizers for assembly programs, establishing a foundation for future research in program performance optimization beyond compiler heuristics.
翻译:超级优化任务旨在将程序转换为功能等效但运行更快的版本。本研究探讨了大语言模型(LLM)能否作为超级优化器,生成优于行业标准编译器优化代码的汇编程序。我们构建了该领域首个大规模基准测试集,包含8,072个平均长度为130行的汇编程序,而现有数据集仅局限于2-15行无循环的直线型程序。我们在该基准上评估了23个LLM,发现最强基线模型Claude-opus-4的测试通过率达到51.5%,相较于gcc -O3编译结果实现了1.43倍的平均加速比。为进一步提升性能,我们采用强化学习对模型进行微调,通过融合正确性与性能加速比的奖励函数进行优化。以Qwen2.5-Coder-7B-Instruct(正确率61.4%,加速比1.10倍)为起点,微调后的SuperCoder模型达到95.0%的正确率与1.46倍平均加速比,结合N选一采样与迭代优化策略可获得更优效果。我们的研究首次证明LLM能够作为汇编程序的超级优化器,为超越编译器启发式规则的程序性能优化研究奠定了基础。