Code LLMs are being rapidly deployed and there is evidence that they can make professional programmers more productive. Current benchmarks for code generation measure whether models generate correct programs given an expert prompt. In this paper, we present a new benchmark containing multiple prompts per problem, written by a specific population of non-expert prompters: beginning programmers. StudentEval contains 1,749 prompts for 48 problems, written by 80 students who have only completed one semester of Python programming. Our students wrote these prompts while working interactively with a Code LLM, and we observed very mixed success rates. We use StudentEval to evaluate 5 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. We analyze the prompts and find significant variation in students' prompting techniques. We also find that nondeterministic LLM sampling could mislead students into thinking that their prompts are more (or less) effective than they actually are, which has implications for how to teach with Code LLMs.
翻译:代码大语言模型正快速部署,有证据表明它们能提升专业程序员的生产力。当前的代码生成基准测试衡量模型在给定专家提示时是否能生成正确程序。本文提出一个新基准测试,包含由特定非专家提示编写群体(初级程序员)为每个问题编写的多个提示。StudentEval包含80名仅完成一学期Python编程的学生针对48个问题编写的1,749个提示。这些学生在与代码大语言模型交互时编写提示,我们观察到其成功率差异显著。我们使用StudentEval评估5个代码大语言模型,发现StudentEval比现有基准测试更能有效区分模型性能。通过分析提示,我们发现学生的提示技术存在显著差异。此外,非确定性的大语言模型采样可能误导学生认为其提示比实际更有效(或更低效),这对如何使用代码大语言模型进行教学具有启示意义。