Automated essay scoring (AES) systems increasingly rely on large language models, yet little is known about how architectural choices shape their performance across different essay quality levels. This paper evaluates single-agent and multi-agent LLM architectures for essay grading using the ASAP 2.0 corpus. Our multi-agent system decomposes grading into three specialist agents (Content, Structure, Language) coordinated by a Chairman Agent that implements rubric-aligned logic including veto rules and score capping. We test both architectures in zero-shot and few-shot conditions using GPT-5.1. Results show that the multi-agent system is significantly better at identifying weak essays while the single-agent system performs better on mid-range essays. Both architectures struggle with high-quality essays. Critically, few-shot calibration emerges as the dominant factor in system performance -- providing just two examples per score level improves QWK by approximately 26% for both architectures. These findings suggest architectural choice should align with specific deployment priorities, with multi-agent AI particularly suited for diagnostic screening of at-risk students, while single-agent models provide a cost-effective solution for general assessment.
翻译:自动作文评分系统日益依赖大语言模型,然而关于架构选择如何影响其在不同质量作文上的表现,目前仍知之甚少。本文使用ASAP 2.0语料库评估了用于作文评分的单智能体与多智能体大语言模型架构。我们的多智能体系统将评分任务分解为三个专业智能体(内容、结构、语言),并由一个主席智能体进行协调,该主席智能体实现了与评分标准对齐的逻辑,包括否决规则和分数上限机制。我们使用GPT-5.1在零样本和少样本条件下测试了两种架构。结果表明,多智能体系统在识别低质量作文方面显著更优,而单智能体系统在中档质量作文上表现更好。两种架构在处理高质量作文时均存在困难。关键发现是,少样本校准成为系统性能的主导因素——仅为每个分数等级提供两个示例,即可使两种架构的二次加权Kappa值提升约26%。这些发现表明,架构选择应与具体部署目标相匹配:多智能体人工智能特别适用于对风险学生的诊断性筛查,而单智能体模型则为通用评估提供了一种经济高效的解决方案。