With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method for this purpose. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimics the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool. The paper also reports a case study with the method in the evaluation of ChatGPT's usability as a code generation tool for the R programming language. Our experiments demonstrated that ChatGPT is highly useful for generating R program code although it may fail on hard programming tasks. The user experiences are good with overall average number of attempts being 1.61 and the average time of completion being 47.02 seconds. Our experiments also found that the weakest aspect of usability is conciseness, which has a score of 3.80 out of 5.
翻译:随着机器学习(ML)技术的快速发展,大型语言模型(LLM)作为一种从自然语言规约生成程序代码的智能工具正得到日益广泛的探索。然而,现有对LLM的评估主要聚焦于其与人类相比的能力。在决定是否于软件生产中使用LLM时,评估其可用性至关重要。为此,本文提出了一种面向用户的评估方法。该方法在基准测试用例中纳入描述其使用场景的元数据,通过模拟LLM使用过程的多轮尝试进行测试,依据一组反映可用性的质量属性对LLM生成的解决方案进行度量,并基于用户将LLM作为工具使用的体验来评估其性能。本文还报告了应用该方法评估ChatGPT作为R编程语言代码生成工具可用性的案例研究。实验表明,ChatGPT在生成R程序代码方面非常有用,尽管其在困难编程任务上可能失败。用户体验良好,总体平均尝试次数为1.61次,平均完成时间为47.02秒。实验同时发现,可用性中最薄弱的环节是简洁性,其得分仅为3.80分(满分5分)。