Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.
翻译:使用大型语言模型(LLM)处理源代码近期备受关注。以Transformer架构为基础的模型(如Codex和ChatGPT)已被证明在解决各类编程问题上具有卓越能力。然而,LLM究竟是基于对问题描述的理解并自主生成程序,还是仅通过表层线索从训练数据中最相关问题中检索源代码,这一问题尚未得到揭示。为探究该研究问题,我们通过实验分析了CodeGen与GPT-3.5系列等几种主流LLM在入门级编程问题代码生成任务中的鲁棒性。实验结果表明,CodeGen和Codex对问题描述的表层修改较为敏感,其代码生成性能会因此受到显著影响。此外,我们观察到Codex依赖变量名,随机化变量名会显著降低问题解决率。然而,当前最先进(SOTA)模型(如InstructGPT和ChatGPT)对表层修改表现出更高的鲁棒性,并具备解决编程问题的卓越能力。这一发现揭示了:对LLM输入提示的细微修改会极大影响代码生成质量,优化提示格式对生成高质量代码至关重要,而SOTA模型对扰动正展现出更强的鲁棒性。