While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critique model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models.
翻译:尽管大语言模型能够在提供答案的同时生成推理说明,但这些说明的本质和质量仍缺乏深入理解。为此,我们旨在系统刻画当代模型的解释能力特征,并构建一个兼具细粒度与可解释性的自动化评估工具——无需依赖昂贵的应用程序编程接口调用或人工标注即可生成此类特征描述。我们的方法是:(a) 定义解释批判新任务——识别并分类解释中的主要缺陷,同时提供改进建议;(b) 构建该任务的大规模人工验证数据集;(c) 基于此数据训练开源自动批判模型(称为数字苏格拉底)。通过定性与定量分析,我们证明了数字苏格拉底如何通过检测学生模型的推理链揭示其深层洞见,并首次实现了对模型解释的高质量、细粒度自动评估。数字苏格拉底由此填补了理解和改进模型解释行为评估工具的重要空白。