It is now common practice in software development for large language models (LLMs) to be used to generate program code. It is desirable to evaluate the robustness of LLMs for this usage. This paper is concerned in particular with how sensitive LLMs are to variations in descriptions of the coding tasks. However, existing techniques for evaluating this robustness are unsuitable for code generation because the input data space of natural language descriptions is discrete. To address this problem, we propose a robustness evaluation method called scenario domain analysis, which aims to find the expected minimal change in the natural language descriptions of coding tasks that would cause the LLMs to produce incorrect outputs. We have formally proved the theoretical properties of the method and also conducted extensive experiments to evaluate the robustness of four state-of-the-art art LLMs: Gemini-pro, Codex, Llamma2 and Falcon 7B, and have found that we are able to rank these with confidence from best to worst. Moreover, we have also studied how robustness varies in different scenarios, including the variations with the topic of the coding task and with the complexity of its sample solution, and found that robustness is lower for more complex tasks and also lower for more advanced topics, such as multi-threading and data structures.
翻译:当前软件开发实践中,使用大语言模型生成程序代码已成为普遍做法。评估大语言模型在此应用场景下的鲁棒性具有重要意义。本文特别关注大语言模型对编程任务描述变化的敏感程度。然而,由于自然语言描述输入空间具有离散特性,现有鲁棒性评估方法并不适用于代码生成任务。为解决该问题,我们提出一种名为"场景域分析"的鲁棒性评估方法,旨在发现能够导致大语言模型产生错误输出的编程任务自然语言描述的最小预期变化。我们已通过形式化证明验证了该方法的理论特性,并通过大量实验评估了四种前沿大语言模型(Gemini-pro、Codex、Llamma2 和 Falcon 7B)的鲁棒性,实验结果表明我们能够以较高置信度对这些模型进行优劣排序。此外,我们还研究了不同场景下鲁棒性的变化规律,包括编程任务主题差异及其示例解决方案复杂度的影响,发现任务复杂度越高则鲁棒性越低,且在多线程和数据结构等高级主题领域鲁棒性表现更弱。