Large Language Models (LLMs) have become indispensable for evaluating writing. However, text feedback they provide is often unintelligible, generic, and not specific to user criteria. Inspired by structured rubrics in education and intelligible AI explanations, we propose iRULER following identified design guidelines to \textit{scaffold} the review process by \textit{specific} criteria, providing \textit{justification} for score selection, and offering \textit{actionable} revisions to target different quality levels. To \textit{qualify} user-defined criteria, we recursively used iRULER with a rubric-of-rubrics to iteratively \textit{refine} rubrics. In controlled experiments on writing revision and rubric creation, iRULER most improved validated LLM-judged review scores and was perceived as most helpful and aligned compared to read-only rubric and text-based LLM feedback. Qualitative findings further support how iRULER satisfies the design guidelines for user-defined feedback. This work contributes interactive rubric tools for intelligible LLM-based review and revision of writing, and user-defined rubric creation.
翻译:大语言模型(LLMs)已成为写作评估不可或缺的工具。然而,其提供的文本反馈往往难以理解、流于泛泛,且不符合用户的具体标准。受教育领域结构化量规与可理解人工智能解释的启发,我们遵循已识别的设计准则提出iRULER,通过以下方式为评审过程提供支持:依据具体标准构建评估框架、为分数选择提供依据,并针对不同质量层次提出可操作的修订建议。为验证用户自定义标准的有效性,我们采用“量规之量规”方法递归运用iRULER,对量规进行迭代优化。在写作修订与量规创建的对照实验中,相较于仅阅读量规和基于文本的LLM反馈,iRULER最能提升经验证的LLM评判的评审分数,并被感知为最具帮助性且与目标最契合。定性研究结果进一步证实iRULER如何满足用户自定义反馈的设计准则。本研究为基于可理解LLM的写作评审与修订,以及用户自定义量规创建,贡献了交互式量规工具。