Receiving immediate and personalized feedback is crucial for second-language learners, and Automated Essay Scoring (AES) systems are a vital resource when human instructors are unavailable. This study investigates the effectiveness of Large Language Models (LLMs), specifically GPT-4 and fine-tuned GPT-3.5, as tools for AES. Our comprehensive set of experiments, conducted on both public and private datasets, highlights the remarkable advantages of LLM-based AES systems. They include superior accuracy, consistency, generalizability, and interpretability, with fine-tuned GPT-3.5 surpassing traditional grading models. Additionally, we undertake LLM-assisted human evaluation experiments involving both novice and expert graders. One pivotal discovery is that LLMs not only automate the grading process but also enhance the performance of human graders. Novice graders when provided with feedback generated by LLMs, achieve a level of accuracy on par with experts, while experts become more efficient and maintain greater consistency in their assessments. These results underscore the potential of LLMs in educational technology, paving the way for effective collaboration between humans and AI, ultimately leading to transformative learning experiences through AI-generated feedback.
翻译:即时且个性化的反馈对第二语言学习者至关重要,而自动作文评分(AES)系统在缺乏人工指导时是一种重要资源。本研究探讨了大语言模型(LLMs),特别是GPT-4和微调版GPT-3.5,作为AES工具的有效性。我们在公开和私有数据集上进行的一系列全面实验,突显了基于LLM的AES系统的显著优势,包括卓越的准确性、一致性、泛化能力和可解释性,其中微调版GPT-3.5超越了传统评分模型。此外,我们开展了涉及新手和专家评分者的LLM辅助人工评估实验。一个关键发现是,LLM不仅能自动化评分过程,还能提升人类评分者的表现。当新手评分者获得LLM生成的反馈时,其准确性达到与专家相当的水平;而专家评分者则变得更加高效,并在评估中保持更高的一致性。这些结果凸显了LLM在教育技术中的潜力,为人机有效协作铺平了道路,最终通过AI生成的反馈实现变革性的学习体验。