This paper reports an evaluation of ChatGPT's capability of generating R programming language code from natural language input. A dataset specially designed for generating R program code was constructed with metadata to support scenario-based testing and evaluation of code generation capabilities in various usage scenarios of different levels of difficulty and different types of programs. The evaluation takes a multiple attempt process in which the tester tries to complete the code generation task through a number of attempts until a satisfactory solution is obtained or gives up after a fixed number of maximal attempts. In each attempt the tester formulates a natural language input to ChatGPT based on the previous results and the task to be completed. In addition to the metrics of average numbers of attempts and average amount of time taken to complete the tasks, the final generated solutions are then assessed on a number of quality attributes, including accuracy, completeness, conciseness, readability, well structuredness, logic clarity, depth of ex-planation, and coverage of parameters. Our experiments demonstrated that ChatGPT is in general highly capable of generating high quality R program code as well as textual explanations although it may fail on hard programming tasks. The experiment data also shows that human developers can hardly learn from experiences naturally to improve the skill of using ChatGPT to generate code.
翻译:本文报告了对ChatGPT从自然语言输入生成R编程语言代码能力的评估。研究构建了一个专门用于生成R程序代码的数据集,并附带元数据,以支持在不同难度级别和不同类型程序的各种使用场景下对代码生成能力进行基于场景的测试与评估。评估采用多次尝试的过程,测试者通过多次尝试完成代码生成任务,直至获得满意解决方案,或在达到固定最大尝试次数后放弃。每次尝试中,测试者根据先前结果及待完成任务,向ChatGPT输入自然语言指令。除平均尝试次数和平均完成任务所需时间等指标外,最终生成的解决方案还需从多个质量属性进行评估,包括准确性、完整性、简洁性、可读性、结构合理性、逻辑清晰度、解释深度及参数覆盖度。实验表明,ChatGPT总体具备生成高质量R程序代码及文本解释的强大能力,尽管在处理困难编程任务时可能失败。实验数据还显示,人类开发者难以自然地从经验中学习提升使用ChatGPT生成代码的技能。