This work investigates the performance of Large Language Models (LLMs) in generating ABAP code. Despite successful applications of generative AI in many programming languages, there are hardly any systematic analyses of ABAP code generation to date. The aim of the study is to empirically analyze to what extent various LLMs can generate syntactically correct and functional ABAP code, how effectively they use compiler feedback for iterative improvement, and which task types pose special challenges. For this purpose, a benchmark with 180 tasks is conducted, consisting of adapted HumanEval tasks and practical SAP scenarios. The results show significant performance differences between the models: more powerful LLMs achieve success rates of around 75% after several iterations and benefit greatly from compiler feedback, while smaller models perform significantly weaker. Overall, the study highlights the high potential of powerful LLMs for ABAP development processes, especially in iterative error correction.
翻译:本研究探讨了大型语言模型在生成ABAP代码方面的性能表现。尽管生成式人工智能已在多种编程语言中取得成功应用,但迄今为止针对ABAP代码生成的系统性分析仍极为有限。本研究的目的是通过实证方法分析各类大型语言模型在以下方面的能力:生成语法正确且功能完整的ABAP代码的程度、利用编译器反馈进行迭代改进的有效性,以及哪些任务类型会带来特殊挑战。为此,我们构建了一个包含180个任务的基准测试集,其中包含改编的HumanEval任务和实际SAP应用场景。实验结果表明不同模型之间存在显著性能差异:更强大的大型语言模型经过数次迭代后成功率可达75%左右,且能极大受益于编译器反馈;而较小模型的性能则明显较弱。总体而言,本研究揭示了强大大型语言模型在ABAP开发流程中的巨大潜力,特别是在迭代错误修正方面。