We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt-based baselines while approaching the performance of supervised systems without requiring task-specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher-level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.
翻译:我们提出MADRAG——一种免训练的分析性作文评分框架,该框架将多智能体推理与检索增强的客观依据相结合。与标准的大语言模型作为评判者(LLM-as-judge)方法(易产生偏差且评分不稳定)不同,MADRAG将评估分解为一个交互式过程:倡导者(Advocate)识别优点,质疑者(Skeptic)批判缺点,而评判者(Judge)综合各方论点得出最终分数。关键在于,评判者通过检索与评分标准对齐的范例进行增强,从而能够通过与已评分示例的比对实现校准。实验结果表明,MADRAG显著优于基于提示的基线方法,同时能在无需任务特定训练的情况下接近监督系统的性能。消融研究证明,检索带来校准性能的提升,而辩论则改善高层次特质的推理能力。我们的发现凸显了结构化交互与外部记忆在可靠的大语言模型评估中的互补作用。