Large Language Models (LLMs) have recently become proficient in addressing complex tasks by utilizing their rich internal knowledge and reasoning ability. Consequently, this complexity hinders traditional input-focused explanation algorithms for explaining the complex decision-making processes of LLMs. Recent advancements have thus emerged for self-explaining their predictions through a single feed-forward inference in a natural language format. However, natural language explanations are often criticized for lack of faithfulness since these explanations may not accurately reflect the decision-making behaviors of the LLMs. In this work, we introduce a generative explanation framework, xLLM, to improve the faithfulness of the explanations provided in natural language formats for LLMs. Specifically, we propose an evaluator to quantify the faithfulness of natural language explanation and enhance the faithfulness by an iterative optimization process of xLLM, with the goal of maximizing the faithfulness scores. Experiments conducted on three NLU datasets demonstrate that xLLM can significantly improve the faithfulness of generated explanations, which are in alignment with the behaviors of LLMs.
翻译:大型语言模型(LLMs)近期因其丰富的内部知识与推理能力,在处理复杂任务方面展现出卓越性能。然而,这种复杂性阻碍了传统以输入为中心的解释算法对LLMs复杂决策过程进行解释。为此,近期研究通过单次前向推理以自然语言形式生成自解释预测。但自然语言解释常因无法真实反映LLMs的决策行为而受到缺乏忠实性的批评。本文提出一种生成式解释框架xLLM,旨在提升LLMs生成的自然语言解释的忠实性。具体而言,我们设计了一个评估器量化自然语言解释的忠实性,并通过xLLM的迭代优化过程(以最大化忠实性评分为目标)增强解释的忠实性。在三个NLU数据集上的实验表明,xLLM能显著提升生成解释的忠实性,使其与LLMs的行为保持高度一致。