Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.
翻译:可解释人工智能(XAI)方法揭示了哪些特征影响模型预测,但为实践者基于这些解释采取行动提供的途径有限。通过XAI识别组件并对其进行激活调控,为可操作解释提供了路径,但其实际效用仍研究不足。我们提出了一种交互式工作流,将基于稀疏自编码器的归因与激活调控相结合,用于视觉模型中概念使用的实例级分析,并以网页工具形式实现。基于该工作流,我们围绕CLIP模型的调试任务开展了半结构化专家访谈(N=8),探究实践者如何推理、信任及应用激活调控。研究发现:调控使参与者从“检查”转向“基于干预的假设检验”(8/8),多数人将信任建立于观测到的模型响应而非仅依赖解释的可信度(6/8)。参与者采用系统化调试策略(以组件抑制为主,7/8),并指出实例级修正存在连锁效应及泛化能力有限等风险。总体而言,激活调控在提升可解释性可操作性的同时,也引发了关于安全有效使用的重要考量。