Large language models (LLMs) are increasingly evaluated and sometimes trained using automated graders such as LLM-as-judges that output scalar scores or preferences. While convenient, these approaches are often opaque: a single score rarely explains why an answer is good or bad, which requirements were missed, or how a system should be improved. This lack of interpretability limits their usefulness for model development, dataset curation, and high-stakes deployment. Query-specific rubric-based evaluation offers a more transparent alternative by decomposing quality into explicit, checkable criteria. However, manually designing high-quality, query-specific rubrics is labor-intensive and cognitively demanding and not feasible for deployment. While previous approaches have focused on generating intermediate rubrics for automated downstream evaluation, it is unclear if these rubrics are both interpretable and effective for human users. In this work, we investigate whether LLMs can generate useful, instance-specific rubrics as compared to human-authored rubrics, while also improving effectiveness for identifying good responses. Through our systematic study on two rubric benchmarks, and on multiple few-shot and post-training strategies, we find that off-the-shelf LLMs produce rubrics that are poorly aligned with human-authored ones. We introduce a simple strategy, RubricRAG, which retrieves domain knowledge via rubrics at inference time from related queries. We demonstrate that RubricRAG can generate more interpretable rubrics both for similarity to human-authored rubrics, and for improved downstream evaluation effectiveness. Our results highlight both the challenges and a promising approach of scalable, interpretable evaluation through automated rubric generation.
翻译:大语言模型(LLMs)正越来越多地采用自动评分器(如LLM-as-judges)生成标量评分或偏好来进行评估乃至训练。尽管便利,但这些方法往往缺乏透明度:单一分数很少能解释答案优劣的原因、遗漏了哪些要求,或系统应如何改进。这种可解释性的缺失限制了它们在模型开发、数据集构建和高风险部署中的实用性。基于查询特定评分标准的评估提供了一种更透明的替代方案,它将质量分解为明确、可核查的准则。然而,手动设计高质量、查询特定的评分标准既费时费力,在认知上要求高,且无法支持部署。尽管先前的研究侧重于生成用于自动下游评估的中间评分标准,但尚不清楚这些评分标准是否同时具有可解释性并能对人类用户有效。在本工作中,我们研究了相较于人工编写的评分标准,大语言模型能否生成有用且实例特定的评分标准,同时提高识别优质响应的有效性。通过在两个评分标准基准上以及多种少样本和后训练策略的系统研究,我们发现,现成的大语言模型生成的评分标准与人工编写的评分标准一致性较差。我们提出了一种简单策略——RubricRAG,该策略在推理时通过评分标准从相关查询中检索领域知识。我们证明,RubricRAG能生成更具可解释性的评分标准,既提高了与人工编写评分标准的相似性,也增强了下游评估的有效性。我们的结果既凸显了挑战,也展示了一种通过自动评分标准生成实现可扩展、可解释评估的有前景方法。