Large language models (LLMs) are increasingly applied in computer science education for tasks such as tutoring, content generation, and code assessment. However, systematic evaluations aligned with formal curricula and certification standards remain limited. This study benchmarked four recent models, including GPT-5, DeepSeek-R1, Qwen-Plus, and Llama-3.3-70B-Instruct, using a dataset of 1,068 questions derived from six certification exams covering networking, office applications, and Java programming. We evaluated performance across language (Chinese vs. English), cognitive levels based on Bloom's Taxonomy, domain knowledge, confidence-accuracy alignment, and robustness to input masking. Results showed that GPT-5 performed best on English-language certifications, while Qwen-Plus performed better in Chinese contexts. DeepSeek-R1 achieved the most balanced cross-lingual performance, whereas Llama-3.3 showed clear limitations in higher-order reasoning and robustness. All models performed worse on more complex tasks. These findings provide empirical support for the integration of LLMs into computer science education and offer practical implications for curriculum design and assessment.
翻译:大型语言模型(LLMs)正越来越多地应用于计算机科学教育领域,用于辅导、内容生成和代码评估等任务。然而,与正式课程及认证标准相契合的系统性评估研究仍较为有限。本研究基于从六项涵盖网络、办公应用及Java编程的专业认证考试中提取的1068道题目,对包括GPT-5、DeepSeek-R1、Qwen-Plus及Llama-3.3-70B-Instruct在内的四个最新模型进行了基准测试。我们从语言维度(中文与英文)、基于布鲁姆认知分类法的认知层次、领域知识、置信度与准确度的一致性,以及对输入掩码的鲁棒性等方面评估了模型性能。结果表明,GPT-5在英语认证考试中表现最佳,而Qwen-Plus在中文语境下表现更优;DeepSeek-R1实现了最均衡的跨语言性能,而Llama-3.3在高阶推理与鲁棒性方面显示出明显局限。所有模型在复杂任务上表现均有所下降。这些发现为将LLMs整合到计算机科学教育中提供了实证支持,并对课程设计与评估具有实践指导意义。