Large Language Models (LLMs) have demonstrated great potential in various language processing tasks, and recent studies have explored their application in compiler optimizations. However, all these studies focus on the conventional open-source LLMs, such as Llama2, which lack enhanced reasoning mechanisms. In this study, we investigate the errors produced by the fine-tuned 7B-parameter Llama2 model as it attempts to learn and apply a simple peephole optimization for the AArch64 assembly code. We provide an analysis of the errors produced by the LLM and compare it with state-of-the-art OpenAI models which implement advanced reasoning logic, including GPT-4o and GPT-o1 (preview). We demonstrate that OpenAI GPT-o1, despite not being fine-tuned, outperforms the fine-tuned Llama2 and GPT-4o. Our findings indicate that this advantage is largely due to the chain-of-thought reasoning implemented in GPT-o1. We hope our work will inspire further research on using LLMs with enhanced reasoning mechanisms and chain-of-thought for code generation and optimization.
翻译:大型语言模型(LLMs)在各种语言处理任务中展现出巨大潜力,近期研究已探索其在编译器优化中的应用。然而,这些研究均聚焦于传统的开源大型语言模型(如Llama2),此类模型缺乏增强的推理机制。本研究分析了经过微调的70亿参数Llama2模型在学习并应用于AArch64汇编代码的简单窥孔优化时产生的错误。我们系统剖析了该大型语言模型的错误类型,并与实现了先进推理逻辑的尖端OpenAI模型(包括GPT-4o和GPT-o1预览版)进行对比。实验表明,未经微调的OpenAI GPT-o1在性能上优于经过微调的Llama2和GPT-4o。我们的研究指出,这种优势主要源于GPT-o1内置的思维链推理机制。本工作期望能启发后续研究,推动具有增强推理能力与思维链机制的大型语言模型在代码生成与优化领域的深入应用。