Evaluating LLMs in the Context of a Functional Programming Course: A Comprehensive Study

Large-Language Models (LLMs) are changing the way learners acquire knowledge outside the classroom setting. Previous studies have shown that LLMs seem effective in generating to short and simple questions in introductory CS courses using high-resource programming languages such as Java or Python. In this paper, we evaluate the effectiveness of LLMs in the context of a low-resource programming language -- OCaml, in an educational setting. In particular, we built three benchmarks to comprehensively evaluate 9 state-of-the-art LLMs: 1) $λ$CodeGen (a benchmark containing natural-language homework programming problems); 2) $λ$Repair (a benchmark containing programs with syntax, type, and logical errors drawn from actual student submissions); 3) $λ$Explain (a benchmark containing natural language questions regarding theoretical programming concepts). We grade each LLMs responses with respect to correctness using the OCaml compiler and an autograder. And our evaluation goes beyond common evaluation methodology by using manual grading to assess the quality of the responses. Our study shows that the top three LLMs are effective on all tasks within a typical functional programming course, although they solve much fewer homework problems in the low-resource setting compared to their success on introductory programming problems in Python and Java. The strength of LLMs lies in correcting syntax and type errors as well as generating answers to basic conceptual questions. While LLMs may not yet match dedicated language-specific tools in some areas, their convenience as a one-stop tool for multiple programming languages can outweigh the benefits of more specialized systems. We hope our benchmarks can serve multiple purposes: to assess the evolving capabilities of LLMs, to help instructors raise awareness among students about the limitations of LLM-generated solutions, and to inform programming language researchers about opportunities to integrate domain-specific reasoning into LLMs and develop more powerful code synthesis and repair tools for low-resource languages.

翻译：大语言模型（LLM）正在改变学习者在课堂外获取知识的方式。先前研究表明，LLM在针对使用高资源编程语言（如Java或Python）的计算机科学入门课程中，似乎能有效生成简短简单问题的解答。本文在OCaml这种低资源编程语言的教育背景下，评估LLM的有效性。具体而言，我们构建了三个基准测试来全面评估9个最先进的LLM：1）$λ$CodeGen（包含自然语言作业编程问题的基准）；2）$λ$Repair（包含从实际学生提交中提取的具有语法、类型和逻辑错误的程序的基准）；3）$λ$Explain（包含关于理论编程概念的自然语言问题的基准）。我们使用OCaml编译器和自动评分器，根据正确性对每个LLM的响应进行评分。此外，我们的评估超越了常见的评估方法，通过人工评分来评估响应的质量。研究表明，尽管在低资源环境下，LLM解决的作业问题数量远少于其在Python和Java入门编程问题上的成功案例，但排名前三的LLM在典型函数式编程课程的所有任务上均表现有效。LLM的优势在于纠正语法和类型错误，以及生成基础概念性问题的答案。虽然在某些领域，LLM可能尚无法媲美专用的语言特定工具，但其作为多编程语言一站式工具的便利性，可能超过更专业化系统的优势。我们希望我们的基准测试能够服务于多重目的：评估LLM不断发展的能力，帮助教师提高学生对LLM生成解决方案局限性的认识，并为编程语言研究者提供信息，以便将领域特定推理整合到LLM中，并为低资源语言开发更强大的代码合成与修复工具。