User Centric Evaluation of Code Generation Tools

from arxiv, The paper is accepted by IEEE AITest 2024 at IEEE CISOSE 2024 Congress as an invited paper, and will appear in the AITest 2024 Conference Proceedings

With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method for this purpose. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimics the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool. The paper also reports a case study with the method in the evaluation of ChatGPT's usability as a code generation tool for the R programming language. Our experiments demonstrated that ChatGPT is highly useful for generating R program code although it may fail on hard programming tasks. The user experiences are good with overall average number of attempts being 1.61 and the average time of completion being 47.02 seconds. Our experiments also found that the weakest aspect of usability is conciseness, which has a score of 3.80 out of 5.

翻译：随着机器学习（ML）技术的快速发展，大型语言模型（LLM）作为一种从自然语言规约生成程序代码的智能工具正得到日益广泛的探索。然而，现有对LLM的评估主要聚焦于其与人类相比的能力。在决定是否于软件生产中使用LLM时，评估其可用性至关重要。为此，本文提出了一种面向用户的评估方法。该方法在基准测试用例中纳入描述其使用场景的元数据，通过模拟LLM使用过程的多轮尝试进行测试，依据一组反映可用性的质量属性对LLM生成的解决方案进行度量，并基于用户将LLM作为工具使用的体验来评估其性能。本文还报告了应用该方法评估ChatGPT作为R编程语言代码生成工具可用性的案例研究。实验表明，ChatGPT在生成R程序代码方面非常有用，尽管其在困难编程任务上可能失败。用户体验良好，总体平均尝试次数为1.61次，平均完成时间为47.02秒。实验同时发现，可用性中最薄弱的环节是简洁性，其得分仅为3.80分（满分5分）。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日