Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2
翻译:大语言模型(LLMs)近期重塑了自动作文评分(AES)领域,然而先前研究通常孤立地考察单项技术,限制了对这些技术在英语作为第二语言(L2)写作评估中相对优势的理解。为弥补这一空白,本研究在雅思写作任务2上对主流基于LLM的AES范式进行了全面比较。在此统一基准上,我们评估了四种方法:(i)基于编码器的分类微调,(ii)零样本与少样本提示,(iii)指令微调与检索增强生成(RAG),以及(iv)监督微调结合直接偏好优化(DPO)与RAG。研究结果揭示了不同方法在准确率、成本与鲁棒性之间存在明确的权衡关系,其中结合k-SFT与RAG的最佳配置取得了最强的综合性能(F1分数达93%)。本研究首次为英语L2写作提供了现代基于LLM的AES策略的统一实证比较,展现了其在写作任务自动评分中的潜力。代码已公开于https://github.com/MinhNguyenDS/LLM_AES-EnL2