Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We release our code and curated datasets at: https://github.com/BBeeChu/FeedEval.git.
翻译:超越数值分数预测,近期自动作文评分研究日益强调生成包含依据与可操作指导的高质量反馈。为缓解专家标注的高昂成本,先前研究常依赖LLM生成的反馈训练作文评估模型。然而,此类反馈通常未经明确质量验证即被直接使用,导致噪声在下游应用中传播。针对此局限,我们提出FeedEval——一个基于LLM的框架,用于沿三个教学维度(特异性、帮助性、有效性)评估LLM生成的作文反馈。FeedEval采用基于本研究构建的数据集训练的维度专用LLM评估器,对多个反馈候选进行评价,并为下游应用筛选高质量反馈。在ASAP++基准上的实验表明:FeedEval与人类专家判断高度一致,且使用FeedEval过滤后高质量反馈训练的作文评分模型取得了更优的评分性能。此外,基于小型LLM的修订实验显示,FeedEval识别的高质量反馈能引导更有效的作文修订。我们已将代码及构建的数据集开源于:https://github.com/BBeeChu/FeedEval.git。