Large-scale foundation models exhibit behavioral shifts: intervention-induced behavioral changes that appear after scaling, fine-tuning, reinforcement learning or in-context learning. While investigating these phenomena have recently received attention, explaining their appearance is still overlooked. Classic explainable AI (XAI) methods can surface failures at a single checkpoint of a model, but they are structurally ill-suited to justify what changed internally across different checkpoints and which explanatory claims are warranted about that change. We take the position that behavioral shifts should be explained comparatively: the core target should be the intervention-induced shift between a reference model and an intervened model, rather than any single model in isolation. To this aim we formulate a Comparative XAI ($Δ$-XAI) framework with a set of desiderata to be taken into account when designing proper explaining methods. To highlight how $Δ$-XAI methods work, we introduce a set of possible pipelines, relate them to the desiderata, and provide a concrete $Δ$-XAI experiment.
翻译:大规模基础模型展现出行为转变:这种由干预引发的行为变化通常在模型扩展、微调、强化学习或上下文学习后出现。尽管对这些现象的研究近来受到关注,但解释其产生原因仍被忽视。经典的可解释人工智能(XAI)方法虽能揭示模型在单一检查点的失效案例,但其结构本质上无法合理解释不同检查点间的内部变化机制,亦难以判定关于该变化的解释性主张是否成立。我们主张行为转变应当通过比较方式进行解释:核心目标应是参考模型与受干预模型之间由干预引发的转变,而非孤立考察任何单一模型。为此我们构建了比较性XAI($Δ$-XAI)框架,提出设计合理解释方法时需考量的一系列基本要求。为阐明$Δ$-XAI方法的运作机制,我们介绍了一组可能的分析流程,将其与基本要求相关联,并提供了具体的$Δ$-XAI实验案例。