Large Language Models (LLMs) are deployed as powerful tools for several natural language processing (NLP) applications. Recent works show that modern LLMs can generate self-explanations (SEs), which elicit their intermediate reasoning steps for explaining their behavior. Self-explanations have seen widespread adoption owing to their conversational and plausible nature. However, there is little to no understanding of their faithfulness. In this work, we discuss the dichotomy between faithfulness and plausibility in SEs generated by LLMs. We argue that while LLMs are adept at generating plausible explanations -- seemingly logical and coherent to human users -- these explanations do not necessarily align with the reasoning processes of the LLMs, raising concerns about their faithfulness. We highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness. We assert that the faithfulness of explanations is critical in LLMs employed for high-stakes decision-making. Moreover, we urge the community to identify the faithfulness requirements of real-world applications and ensure explanations meet those needs. Finally, we propose some directions for future work, emphasizing the need for novel methodologies and frameworks that can enhance the faithfulness of self-explanations without compromising their plausibility, essential for the transparent deployment of LLMs in diverse high-stakes domains.
翻译:大语言模型(LLMs)已被部署为多项自然语言处理(NLP)应用的强大工具。近期研究表明,现代LLMs能够生成自解释(SEs),这些解释通过展示中间推理步骤来阐明其行为机制。自解释因其对话性与表面合理性而得到广泛应用。然而,对其忠实性几乎缺乏认知。本文探讨LLMs生成的自解释在忠实性与合理性之间的对立关系。我们论证:尽管LLMs善于生成表面合理的解释(对人类而言看似逻辑连贯),但这些解释未必与LLMs的真实推理过程一致,引发对其忠实性的担忧。我们强调当前提升解释合理性的趋势(主要受用户友好界面需求驱动)可能以牺牲其忠实性为代价。我们断言,在高风险决策场景中使用的LLMs,其解释的忠实性至关重要。此外,我们呼吁学界识别真实应用场景对解释忠实性的具体要求,并确保解释能满足这些需求。最后,我们提出未来研究方向,强调需要发展能够在不损害合理性的前提下增强自解释忠实性的新型方法论与框架——这对LLMs在不同高风险领域的透明部署至关重要。