Programming is a core skill in computer science and software engineering (SE), yet identifying and resolving code errors remains challenging for both novice and experienced developers. While Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation tasks, their potential in domain-specific, complex scenarios, such as multi-label classification (MLC) of programming errors, remains underexplored. Recognizing this less-explored area, this study proposes a multi-label error classification (MLEC) framework for source code that leverages fine-tuned LLMs, including CodeT5-base, GraphCodeBERT, CodeT5+, UniXcoder, RoBERTa, PLBART, and CoTexT. These LLMs are integrated with deep learning (DL) architectures such as GRU, LSTM, BiLSTM, and BiLSTM with an additive attention mechanism (BiLSTM-A) to capture both syntactic and semantic features from a real-world student-written Python code error dataset. Extensive experiments across 32 model variants, optimized using Optuna-based hyperparameter tuning, have been evaluated using comprehensive multi-label metrics, including average accuracy, macro and weighted precision, recall, F1-score, exact match accuracy, One-error, Hamming loss, Jaccard similarity, and ROC-AUC (micro, macro, and weighted). Results show that the CodeT5+\_GRU model achieved the strongest performance, with a weighted F1-score of 0.8243, average accuracy of 91.84\%, exact match accuracy of 53.78\%, Hamming loss of 0.0816, and One error of 0.0708. These findings confirm the effectiveness of combining pretrained semantic encoders with efficient recurrent decoders. This work lays the foundation for developing intelligent, scalable tools for automated code feedback, with potential applications in programming education (PE) and broader SE domains.
翻译:编程是计算机科学与软件工程(SE)中的核心技能,然而识别并解决代码错误对新手和资深开发者而言仍具挑战。尽管大语言模型(LLMs)在自然语言理解与生成任务中展现出卓越能力,但其在领域特定复杂场景(如编程错误的多标签分类(MLC))中的潜力仍未充分挖掘。针对这一研究空白,本文提出一种面向源代码的多标签错误分类(MLEC)框架,该框架利用微调后的LLMs(包括CodeT5-base、GraphCodeBERT、CodeT5+、UniXcoder、RoBERTa、PLBART和CoTexT),并与深度学习(DL)架构(如GRU、LSTM、BiLSTM及带加性注意力机制的BiLSTM(BiLSTM-A))相结合,从真实学生编写的Python代码错误数据集中捕获语法与语义特征。基于Optuna超参数调优的32种模型变体经过广泛实验,采用包括平均准确率、宏平均与加权精确率、召回率、F1分数、精确匹配准确率、单错误率、汉明损失、杰卡德相似度以及ROC-AUC(微平均、宏平均和加权平均)在内的多标签综合指标进行评估。结果表明,CodeT5+\_GRU模型表现最佳,其加权F1分数达0.8243,平均准确率91.84%,精确匹配准确率53.78%,汉明损失0.0816,单错误率0.0708。这些结果验证了将预训练语义编码器与高效循环解码器相结合的有效性。本研究为开发智能化、可扩展的自动代码反馈工具奠定了基础,并有望应用于编程教育(PE)及更广泛的软件工程领域。