While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critiquing model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models.
翻译:尽管大语言模型(LLMs)在提供答案的同时能够生成带有推理过程的解释,但这些解释的本质和质量仍未被充分理解。为此,我们的目标是系统化定义现代模型解释能力的表征方法,并构建一个兼具解释性与可解释性的评估工具,使其能够自动生成此类表征,而无需依赖昂贵的API调用或人工标注。我们的研究方法是:(a)定义解释性批评这一新任务——识别并分类解释中的主要缺陷,同时为缺陷修正提供建议;(b)构建规模可观且经人工验证的数据集;(c)利用该数据集训练开源自动批评模型(称为Digital Socrates)。通过定量与定性分析,我们展示了Digital Socrates如何通过分析学生模型的推理链来揭示其内在特性,以及它如何首次实现对模型解释的高质量、精细化自动评估。因此,Digital Socrates填补了理解与改进模型解释行为评估工具的重要空白。