A rapidly developing application of LLMs in XAI is to convert quantitative explanations such as SHAP into user-friendly narratives to explain the decisions made by smaller prediction models. Evaluating the narratives without relying on human preference studies or surveys is becoming increasingly important in this field. In this work we propose a framework and explore several automated metrics to evaluate LLM-generated narratives for explanations of tabular classification tasks. We apply our approach to compare several state-of-the-art LLMs across different datasets and prompt types. As a demonstration of their utility, these metrics allow us to identify new challenges related to LLM hallucinations for XAI narratives.
翻译:大型语言模型在可解释人工智能领域的一个快速发展应用,是将SHAP等定量解释转化为用户友好的叙述,以解释较小预测模型的决策过程。在不依赖人类偏好研究或调查的情况下评估这些叙述,正成为该领域日益重要的问题。在本工作中,我们提出了一个框架并探索了多种自动化指标,用于评估针对表格分类任务解释所生成的LLM叙述。我们应用该方法比较了不同数据集和提示类型下的多种前沿大型语言模型。作为其实用性的例证,这些指标使我们能够识别出与可解释人工智能叙事中LLM幻觉相关的新挑战。