Large language models have demonstrated exceptional capabilities in tasks involving natural language generation, reasoning, and comprehension. This study aims to construct prompts and comments grounded in the diverse scoring criteria delineated within the official TOEFL guide. The primary objective is to assess the capabilities and constraints of ChatGPT, a prominent representative of large language models, within the context of automated essay scoring. The prevailing methodologies for automated essay scoring involve the utilization of deep neural networks, statistical machine learning techniques, and fine-tuning pre-trained models. However, these techniques face challenges when applied to different contexts or subjects, primarily due to their substantial data requirements and limited adaptability to small sample sizes. In contrast, this study employs ChatGPT to conduct an automated evaluation of English essays, even with a small sample size, employing an experimental approach. The empirical findings indicate that ChatGPT can provide operational functionality for automated essay scoring, although the results exhibit a regression effect. It is imperative to underscore that the effective design and implementation of ChatGPT prompts necessitate a profound domain expertise and technical proficiency, as these prompts are subject to specific threshold criteria. Keywords: ChatGPT, Automated Essay Scoring, Prompt Learning, TOEFL Independent Writing Task
翻译:大语言模型在自然语言生成、推理和理解等任务中展现出了卓越的能力。本研究旨在基于官方托福指南中阐述的多种评分标准构建提示词和评语。其主要目标是评估大语言模型的杰出代表ChatGPT在自动作文评分情境下的能力与局限。当前自动作文评分的主流方法涉及使用深度神经网络、统计机器学习技术以及对预训练模型进行微调。然而,这些技术在不同情境或科目中应用时面临挑战,主要原因是它们对数据量要求高,且对样本量较小的适应性有限。相比之下,本研究采用实验方法,即使在样本量较小的情况下,仍利用ChatGPT对英语作文进行自动评估。实证结果表明,ChatGPT能够为自动作文评分提供可操作的功能,尽管结果表现出回归效应。必须强调的是,ChatGPT提示词的有效设计与实施需要深厚的领域知识与专业技能,因为这些提示词受到特定的阈值标准约束。关键词:ChatGPT;自动作文评分;提示学习;托福独立写作任务