The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing, to the best of our knowledge, the first contrastive explanation methods requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a distance function that has meaning to the user and not necessarily a real valued representation of a specific response (viz. class label). We offer two algorithms for finding contrastive explanations: i) A myopic algorithm, which although effective in creating contrasts, requires many model calls and ii) A budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts adhering to a query budget, necessary for longer contexts. We show the efficacy of these methods on diverse natural language tasks such as open-text generation, automated red teaming, and explaining conversational degradation.
翻译:随着黑盒深度神经网络分类模型的出现,解释其决策的需求日益凸显。然而,对于大型语言模型(LLMs)这类生成式人工智能,并不存在需要解释的类别预测。相反,我们可以探究为何LLM会对给定提示生成特定响应。本文通过提出(据我们所知)首个仅需黑盒/查询访问的对比解释方法来回答这一问题。我们的解释表明,LLM对给定提示生成回复的原因是:如果提示被轻微修改,LLM将给出不同响应,该响应要么更不理想,要么与原响应相矛盾。关键洞见在于:对比解释仅需对用户具有意义的距离函数,而不一定需要特定响应(即类别标签)的实值表示。我们提供两种寻找对比解释的算法:i) 近视算法——虽能有效创建对比,但需要大量模型调用;ii) 预算算法——作为主要算法贡献,能在查询预算限制下智能创建对比,这对长上下文场景至关重要。我们通过开放文本生成、自动化红队测试和对话退化解释等多种自然语言任务验证了这些方法的有效性。