Recent advances in large language models (LLMs) have driven extensive evaluations in software engineering. however, most prior work concentrates on code-level tasks, leaving software design capabilities underexplored. To fill this gap, we conduct a comprehensive empirical study evaluating 29 LLMs on object-oriented design (OOD) tasks. Owing to the lack of standardized benchmarks and metrics, we introduce OODEval, a manually constructed benchmark comprising 50 OOD tasks of varying difficulty, and OODEval-Human, the first human-rated OOD benchmark, which includes 940 undergraduate-submitted class diagrams evaluated by instructors. We further propose CLUE (Class Likeness Unified Evaluation), a unified metric set that assesses both global correctness and fine-grained design quality in class diagram generation. Using these benchmarks and metrics, we investigate five research questions: overall correctness, comparison with humans, model dimension analysis, task feature analysis, and bad case analysis. The results indicate that while LLMs achieve high syntactic accuracy, they exhibit substantial semantic deficiencies, particularly in method and relationship generation. Among the evaluated models, Qwen3-Coder-30B achieves the best overall performance, rivaling DeepSeek-R1 and GPT-4o, while Gemma3-4B-IT outperforms GPT-4o-Mini despite its smaller parameter scale. Although top-performing LLMs nearly match the average performance of undergraduates, they remain significantly below the level of the best human designers. Further analysis shows that parameter scale, code specialization, and instruction tuning strongly influence performance, whereas increased design complexity and lower requirement readability degrade it. Bad case analysis reveals common failure modes, including keyword misuse, missing classes or relationships, and omitted methods.
翻译:大语言模型(LLM)的最新进展推动了软件工程领域的广泛评估。然而,现有研究大多集中于代码级任务,对软件设计能力的探索不足。为填补这一空白,我们开展了一项全面的实证研究,评估了29个大语言模型在面向对象设计(OOD)任务上的表现。由于缺乏标准化的基准和指标,我们提出了OODEval——一个包含50个不同难度OOD任务的人工构建基准,以及首个由人工评级的OOD基准OODEval-Human,该基准包含940份由本科生提交、经教师评估的类图。我们进一步提出了CLUE(类相似度统一评估),这是一套用于评估类图生成中全局正确性和细粒度设计质量的统一指标集。利用这些基准和指标,我们探究了五个研究问题:整体正确性、与人类表现的比较、模型维度分析、任务特征分析以及错误案例分析。结果表明,尽管大语言模型在语法准确性上表现出色,但在语义层面存在显著缺陷,尤其是在方法和关系的生成上。在所有评估模型中,Qwen3-Coder-30B取得了最佳整体性能,可与DeepSeek-R1和GPT-4o相媲美,而Gemma3-4B-IT尽管参数量较小,却超越了GPT-4o-Mini。虽然表现最佳的大语言模型几乎达到了本科生的平均水平,但仍显著落后于最优秀的人类设计者。进一步分析表明,参数量、代码专业化和指令微调对性能有显著影响,而设计复杂度的增加和需求可读性的降低则会损害性能。错误案例分析揭示了常见的失败模式,包括关键词误用、类或关系缺失以及方法遗漏。