The advancement of natural language processing has paved the way for automated scoring systems in various languages, such as German (e.g., German BERT [G-BERT]). Automatically scoring written responses to science questions in German is a complex task and challenging for standard G-BERT as they lack contextual knowledge in the science domain and may be unaligned with student writing styles. This paper developed a contextualized German Science Education BERT (G-SciEdBERT), an innovative large language model tailored for scoring German-written responses to science tasks. Using G-BERT, we pre-trained G-SciEdBERT on a corpus of 50K German written science responses with 5M tokens to the Programme for International Student Assessment (PISA) 2015. We fine-tuned G-SciEdBERT on 59 assessment items and examined the scoring accuracy. We then compared its performance with G-BERT. Our findings reveal a substantial improvement in scoring accuracy with G-SciEdBERT, demonstrating a 10% increase of quadratic weighted kappa compared to G-BERT (mean accuracy difference = 0.096, SD = 0.024). These insights underline the significance of specialized language models like G-SciEdBERT, which is trained to enhance the accuracy of automated scoring, offering a substantial contribution to the field of AI in education.
翻译:自然语言处理的进步为多种语言(如德语,例如German BERT [G-BERT])的自动评分系统奠定了基础。对德语科学问题的书面回答进行自动评分是一项复杂任务,对标准G-BERT而言颇具挑战性,因为它们缺乏科学领域的上下文知识,且可能与学生的写作风格不一致。本文开发了上下文化的德语科学教育BERT(G-SciEdBERT)——一种针对德语科学任务书面回答评分而定制的大语言模型。我们以G-BERT为基础,在包含5M个词元的50K条德语科学回答语料库(源自2015年国际学生评估项目PISA)上对G-SciEdBERT进行了预训练。随后,我们在59个评估项目上对G-SciEdBERT进行微调并检验其评分准确性,并将其性能与G-BERT进行对比。研究结果表明,G-SciEdBERT的评分准确性显著提升,与G-BERT相比,二次加权卡帕系数提高了10%(平均准确性差异=0.096,标准差=0.024)。这些发现凸显了G-SciEdBERT等专用语言模型的重要性——该模型通过针对性训练增强了自动评分的准确性,为人工智能在教育领域的发展做出了重要贡献。