CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

In the evolving landscape of large language models (LLMs) tailored for software engineering, the need for benchmarks that accurately reflect real-world development scenarios is paramount. Current benchmarks are either too simplistic or fail to capture the multi-tasking nature of software development. To address this, we introduce CoderUJB, a new benchmark designed to evaluate LLMs across diverse Java programming tasks that are executable and reflective of actual development scenarios, acknowledging Java's prevalence in real-world software production. CoderUJB comprises 2,239 programming questions derived from 17 real open-source Java projects and spans five practical programming tasks. Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs, examining the effects of continued pre-training in specific programming languages code and instruction fine-tuning on their performance. The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation (e.g., test generation and defect detection). Importantly, our results advise caution in the specific programming languages continued pre-training and instruction fine-tuning, as these techniques could hinder model performance on certain tasks, suggesting the need for more nuanced strategies. CoderUJB thus marks a significant step towards more realistic evaluations of programming capabilities in LLMs, and our study provides valuable insights for the future development of these models in software engineering.

翻译：在大语言模型（LLM）为软件工程定制化发展的背景下，基准测试需准确反映真实开发场景的需求日益凸显。现有基准测试要么过于简化，要么未能捕捉软件开发的多任务特性。为解决此问题，我们提出CoderUJB——一个旨在评估LLM在多样化Java编程任务中表现的新型基准测试，该基准测试兼具可执行性与对实际开发场景的反映能力，充分考虑了Java在现实软件生产中的普遍性。CoderUJB包含来自17个真实开源Java项目的2,239道编程问题，覆盖五类实际编程任务。我们针对该基准测试开展的实证研究，探讨了多种开源与闭源LLM的编码能力，并分析了在特定编程语言代码上进行持续预训练以及指令微调对其性能的影响。研究结果表明，尽管LLM展现出强大潜力，但在非功能性代码生成（如测试生成与缺陷检测）方面仍面临挑战。更重要的是，我们的研究结果建议对特定编程语言的持续预训练与指令微调持谨慎态度——这些技术可能在某些任务上抑制模型性能，提示需要更精细化的策略。CoderUJB由此标志着向更真实评估LLM编程能力迈出重要一步，而我们的研究也为软件工程领域这些模型的未来发展提供了宝贵见解。