The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.
翻译:大语言模型(LLMs)的编程能力已经彻底改变了自动代码生成领域,并为自动统计分析开辟了新途径。然而,在广泛采用之前,这些生成代码的有效性和质量需要进行系统评估。尽管LLMs日益受到重视,但文献中对于LLMs生成的统计代码的全面评估仍然匮乏。本文评估了LLMs(包括两个版本的ChatGPT和一个版本的Llama)在用于统计分析的SAS编程领域的性能。我们的研究采用了一套涵盖多种统计主题和数据集的统计分析任务集。每个任务都包含问题描述、数据集信息和经过人工验证的SAS代码。我们通过基于正确性、有效性、可读性、可执行性以及输出结果准确性的人工专家评估,对LLMs生成的SAS代码质量进行了全面评估。评分分析表明,虽然LLMs在生成语法正确的代码方面表现出实用性,但在需要深入领域理解的任务上仍存在困难,并可能产生冗余或错误的结果。本研究为理解LLMs在统计编程中的能力和局限性提供了有价值的见解,为未来AI辅助统计分析编码系统的进步提供了指导。