As generative AI systems become integrated into real-world applications, organizations increasingly need to be able to understand and interpret their behavior. In particular, decision-makers need to understand what causes generative AI systems to exhibit specific output characteristics. Within this general topic, this paper examines a key question: what is it about the input -the prompt- that causes an LLM-based generative AI system to produce output that exhibits specific characteristics, such as toxicity, negative sentiment, or political bias. To examine this question, we adapt a common technique from the Explainable AI literature: counterfactual explanations. We explain why traditional counterfactual explanations cannot be applied directly to generative AI systems, due to several differences in how generative AI systems function. We then propose a flexible framework that adapts counterfactual explanations to non-deterministic, generative AI systems in scenarios where downstream classifiers can reveal key characteristics of their outputs. Based on this framework, we introduce an algorithm for generating prompt-counterfactual explanations (PCEs). Finally, we demonstrate the production of counterfactual explanations for generative AI systems with three case studies, examining different output characteristics (viz., political leaning, toxicity, and sentiment). The case studies further show that PCEs can streamline prompt engineering to suppress undesirable output characteristics and can enhance red-teaming efforts to uncover additional prompts that elicit undesirable outputs. Ultimately, this work lays a foundation for prompt-focused interpretability in generative AI: a capability that will become indispensable as these models are entrusted with higher-stakes tasks and subject to emerging regulatory requirements for transparency and accountability.
翻译:随着生成式AI系统被集成到实际应用中,组织机构日益需要理解和解释其行为。特别是,决策者需要理解导致生成式AI系统表现出特定输出特征的原因。在此宏观议题下,本文研究了一个核心问题:输入(即提示)中的哪些因素会导致基于LLM的生成式AI系统产生具有特定特征(如毒性、负面情绪或政治偏见)的输出。为探究该问题,我们借鉴了可解释AI文献中的常用技术——反事实解释。我们阐明了由于生成式AI系统运作方式的若干差异,传统反事实解释无法直接应用于此类系统的原因。随后,我们提出一个灵活框架,将反事实解释适配于非确定性的生成式AI系统,其适用场景为下游分类器能够揭示输出关键特征的情况。基于该框架,我们提出了一种生成提示-反事实解释(PCE)的算法。最后,我们通过三个案例研究展示了为生成式AI系统构建反事实解释的过程,分别考察了不同的输出特征(即政治倾向、毒性和情感倾向)。案例研究进一步表明,PCE能够优化提示工程以抑制不良输出特征,并能加强红队测试以发现更多引发不良输出的提示。最终,本研究为生成式AI中聚焦提示的可解释性奠定了基础——随着这些模型被赋予更高风险任务并面临新兴的透明度与问责监管要求,这项能力将变得不可或缺。