Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBench

Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills, while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism). To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM's capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at https://github.com/java-bench/JavaBench.

翻译：诸如HumanEval之类的代码生成基准测试被广泛用于评估大型语言模型的能力。然而，在整合了最新的24个基准测试后，我们注意到三个显著的不平衡现象。首先，编程语言不平衡。95.8%的基准测试涉及Python，而仅有5个基准测试涉及Java。其次，代码粒度不平衡。函数/语句级别的基准测试占所有基准测试的83.3%以上。仅有极少数扩展到类/项目级别，且全部局限于Python。第三，缺乏高级特性。现有基准测试主要评估基本编码技能，而忽视了高级面向对象编程特性（即封装、继承和多态）。为填补这些空白，我们提出了JavaBench，一个运用OOP特性的项目级Java基准测试。它包含四个Java项目，涵盖106个Java类中的389个方法。测试覆盖率高达92%，并且JavaBench经过了282名本科生的验证，平均得分达到90.93/100（即针对测试套件的通过率），确保了文档、代码骨架和测试的质量。为了更好地评估大型语言模型在JavaBench上的能力，我们引入了一个系统化的评估设计，涵盖三种上下文设置、五种合成策略（在两个粒度上），并使用三个层次化的度量标准。我们的大量实验得出了几个有趣的发现。首先，我们注意到，在项目级Java编程方面，大型语言模型远远落后于本科生（所研究的所有大型语言模型均无法正确完成任何项目，在更宽松的评估中最多达到41.17%的Pass@5）。其次，使用方法签名作为提示上下文可能在项目级代码生成中达到理想的平衡。JavaBench已在https://github.com/java-bench/JavaBench公开提供。