While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critique model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models.
翻译:尽管大语言模型能够为其答案提供推理解释,但这些解释的性质与质量仍未被充分理解。为此,我们的目标是建立一种详细刻画现代模型解释能力的表征方法,并创建一个细致、可解释的自动评估工具,以在不依赖昂贵API调用或人工标注的情况下生成此类表征。我们的研究路径包括:(a)定义“解释性批判”这一新任务——识别并分类解释中的主要缺陷,并提供改进建议;(b)为此任务构建一个大规模人工验证数据集;(c)基于该数据训练一个开源自动批判模型(称为“数字苏格拉底”)。通过定量与定性分析,我们论证了数字苏格拉底如何通过检视学生模型的推理链来揭示其内在机制,并首次实现对模型解释的高质量、精细化自动评估。数字苏格拉底由此填补了评估工具领域的重要空白,为理解与改进模型的解释行为提供了关键支撑。