As generative AI systems become integrated into real-world applications, organizations increasingly need to be able to understand and interpret their behavior. In particular, decision-makers need to understand what causes generative AI systems to exhibit specific output characteristics. Within this general topic, this paper examines a key question: what is it about the input -- the prompt -- that causes an LLM-based generative AI system to produce output that exhibits specific characteristics, such as toxicity, negative sentiment, or political bias. To examine this question, we adapt a common technique from the Explainable AI literature: counterfactual explanations. We explain why traditional counterfactual explanations cannot be applied directly to generative AI systems, due to several differences in how generative AI systems function. We then propose a flexible framework that adapts counterfactual explanations to non-deterministic, generative AI systems in scenarios where downstream classifiers can reveal key characteristics of their outputs. Based on this framework, we introduce an algorithm for generating prompt-counterfactual explanations (PCEs). Finally, we demonstrate the production of counterfactual explanations for generative AI systems with three case studies, examining different output characteristics (viz., political leaning, toxicity, and sentiment). The case studies further show that PCEs can streamline prompt engineering to suppress undesirable output characteristics and can enhance red-teaming efforts to uncover additional prompts that elicit undesirable outputs. Ultimately, this work lays a foundation for prompt-focused interpretability in generative AI: a capability that will become indispensable as these models are entrusted with higher-stakes tasks and subject to emerging regulatory requirements for transparency and accountability.
翻译:随着生成式人工智能系统被整合到实际应用中,组织日益需要理解并解释其行为。具体而言,决策者需要了解导致生成式人工智能系统表现出特定输出特征的原因。在此宏观议题下,本文探讨一个核心问题:输入(即提示)中的何种因素会导致基于LLM的生成式人工智能系统产生具有特定特征(如毒性、负面情绪或政治偏见)的输出。为探究此问题,我们借鉴可解释人工智能文献中的常用技术:反事实解释。我们阐释了由于生成式人工智能系统在运作方式上的若干差异,传统反事实解释无法直接应用于此类系统。随后,我们提出一个灵活框架,将反事实解释适配于非确定性的生成式人工智能系统,适用于下游分类器能够揭示输出关键特征的场景。基于此框架,我们提出一种生成提示-反事实解释(PCEs)的算法。最后,通过三个案例研究(分别考察政治倾向、毒性和情感等不同输出特征),我们演示了为生成式人工智能系统生成反事实解释的过程。案例研究进一步表明,PCEs能够优化提示工程以抑制不良输出特征,并可增强红队测试工作以发现更多引发不良输出的提示。最终,本研究为生成式人工智能中聚焦提示的可解释性奠定了基础——随着这些模型被赋予更高风险任务并面临新兴的透明度与问责监管要求,这项能力将变得不可或缺。