基于提示的反事实解释用于生成式AI系统行为分析 (Prompt-Counterfactual Explanations for Generative AI System Behavior)

As generative AI systems become integrated into real-world applications, organizations increasingly need to be able to understand and interpret their behavior. In particular, decision-makers need to understand what causes generative AI systems to exhibit specific output characteristics. Within this general topic, this paper examines a key question: what is it about the input -the prompt- that causes an LLM-based generative AI system to produce output that exhibits specific characteristics, such as toxicity, negative sentiment, or political bias. To examine this question, we adapt a common technique from the Explainable AI literature: counterfactual explanations. We explain why traditional counterfactual explanations cannot be applied directly to generative AI systems, due to several differences in how generative AI systems function. We then propose a flexible framework that adapts counterfactual explanations to non-deterministic, generative AI systems in scenarios where downstream classifiers can reveal key characteristics of their outputs. Based on this framework, we introduce an algorithm for generating prompt-counterfactual explanations (PCEs). Finally, we demonstrate the production of counterfactual explanations for generative AI systems with three case studies, examining different output characteristics (viz., political leaning, toxicity, and sentiment). The case studies further show that PCEs can streamline prompt engineering to suppress undesirable output characteristics and can enhance red-teaming efforts to uncover additional prompts that elicit undesirable outputs. Ultimately, this work lays a foundation for prompt-focused interpretability in generative AI: a capability that will become indispensable as these models are entrusted with higher-stakes tasks and subject to emerging regulatory requirements for transparency and accountability.

翻译：随着生成式AI系统被集成到实际应用中，组织机构日益需要理解和解释其行为。特别是，决策者需要理解导致生成式AI系统表现出特定输出特征的原因。在此宏观议题下，本文研究了一个核心问题：输入（即提示）中的哪些因素会导致基于LLM的生成式AI系统产生具有特定特征（如毒性、负面情绪或政治偏见）的输出。为探究该问题，我们借鉴了可解释AI文献中的常用技术——反事实解释。我们阐明了由于生成式AI系统运作方式的若干差异，传统反事实解释无法直接应用于此类系统的原因。随后，我们提出一个灵活框架，将反事实解释适配于非确定性的生成式AI系统，其适用场景为下游分类器能够揭示输出关键特征的情况。基于该框架，我们提出了一种生成提示-反事实解释（PCE）的算法。最后，我们通过三个案例研究展示了为生成式AI系统构建反事实解释的过程，分别考察了不同的输出特征（即政治倾向、毒性和情感倾向）。案例研究进一步表明，PCE能够优化提示工程以抑制不良输出特征，并能加强红队测试以发现更多引发不良输出的提示。最终，本研究为生成式AI中聚焦提示的可解释性奠定了基础——随着这些模型被赋予更高风险任务并面临新兴的透明度与问责监管要求，这项能力将变得不可或缺。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

大语言模型中的事件抽取：方法、模态与未来展望的全面综述

专知会员服务

19+阅读 · 2025年12月23日

《大语言模型和生成式人工智能技术基础介绍》美国防分析研究所最新52页报告

专知会员服务

60+阅读 · 2024年12月30日

生成式人工智能大型语言模型的安全性：概述

专知会员服务

35+阅读 · 2024年7月30日

GPT文本如何检测？《检测AI生成文本：影响当前方法检测能力的因素》最新综述

专知会员服务

24+阅读 · 2024年7月3日