Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays

Automated Essay Scoring systems have traditionally focused on holistic scores, limiting their pedagogical usefulness, especially in the case of complex essay genres such as argumentative writing. In educational contexts, teachers and learners require interpretable, trait-level feedback that aligns with instructional goals and established rubrics. In this paper, we study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms designed for realistic educational deployment: (1) structured in-context learning with small open-source LLMs, and (2) a supervised, encoder-based BigBird model with a CORAL-style ordinal regression formulation, optimized for long-sequence understanding. We conduct a systematic evaluation on the ASAP++ dataset, which includes essay scores across five quality traits, offering strong coverage of core argumentation dimensions. LLMs are prompted with designed, rubric-aligned in-context examples, along with feedback and confidence requests, while we explicitly model ordinality in scores with the BigBird model via the rank-consistent CORAL framework. Our results show that explicitly modeling score ordinality substantially improves agreement with human raters across all traits, outperforming LLMs and nominal classification and regression-based baselines. This finding reinforces the importance of aligning model objectives with rubric semantics for educational assessment. At the same time, small open-source LLMs achieve a competitive performance without task-specific fine-tuning, particularly for reasoning-oriented traits, while enabling transparent, privacy-preserving, and locally deployable assessment scenarios. Our findings provide methodological, modeling, and practical insights for the design of AI-based educational systems that aim to deliver interpretable, rubric-aligned feedback for argumentative writing.

翻译：自动作文评分系统传统上侧重于整体评分，这限制了其教学实用性，特别是在议论文等复杂文体中。在教育情境下，教师和学习者需要可解释的、特质层面的反馈，以符合教学目标和既定评分标准。本文研究基于特质的自动议论文评分，采用两种为现实教育部署设计的互补建模范式：(1) 使用小型开源LLM进行结构化上下文学习，(2) 采用基于编码器的监督式BigBird模型，结合CORAL风格的序数回归公式，专为长序列理解优化。我们在ASAP++数据集上进行系统评估，该数据集包含五个质量特质的作文评分，全面覆盖核心论证维度。我们通过设计的、与评分标准对齐的上下文示例以及反馈和置信度要求来提示LLM，同时通过BigBird模型在CORAL框架下显式建模评分的序数性。结果表明，显式建模评分序数性显著提高了所有特质上与人类评分者的一致性，优于LLM以及基于名义分类和回归的基线方法。这一发现强化了在教育评估中将模型目标与评分标准语义对齐的重要性。同时，小型开源LLM无需任务特定微调即达到有竞争力的性能，尤其在推理导向的特质上，同时支持透明、保护隐私且可本地部署的评估场景。我们的研究结果为设计基于AI的教育系统提供了方法论、建模和实践见解，旨在为议论文写作提供可解释的、与评分标准对齐的反馈。