PhaseWin: An Efficient Search Algorithm for Faithful Visual Attribution

Visual attribution is a fundamental tool for interpreting modern vision and vision-language models, particularly when their decisions must be inspected, diagnosed, or audited. Its goal is to explain how a model's decision depends on local regions of the visual input, typically by assigning an importance ordering over candidate image regions. Given an image partitioned into $n$ regions, faithful attribution can be cast as an ordered subset-search problem, in which progressively inserting the selected regions should recover the target model response as early as possible. Exhaustive search over region subsets incurs exponential cost, while the widely used greedy search still requires a quadratic number of model evaluations, because every selection step rescores all remaining candidates. We propose PhaseWin, an efficient subset-search algorithm for faithful visual attribution. PhaseWin reorganizes greedy region selection into a phased window-search procedure: rather than re-evaluating the full candidate set at every step, it alternates between global candidate screening, adaptive pruning, and localized window refinement, while preserving the essential region-ranking behavior of greedy search. We analyze PhaseWin under monotone evidence-accumulation conditions and show that, under feature-level structural assumptions, it attains controllable linear evaluation complexity together with near-greedy faithfulness guarantees. Extensive experiments on image classification, object detection, visual grounding, and image captioning show that, among all compared attribution methods, PhaseWin reaches high faithfulness with the fewest forward passes, empirically realizing the predicted reduction from $O(n^2)$ to $O(n)$. The code is available at https://github.com/Qihuai27/phasewin-va.

翻译：视觉归因是解释现代视觉与视觉-语言模型的基本工具，尤其在需要检查、诊断或审计模型决策时尤为重要。其目标是通过对候选图像区域分配重要性排序，解释模型决策如何依赖于视觉输入的局部区域。当图像被划分为$n$个区域时，忠实归因可转化为有序子集搜索问题：逐步插入选定区域时，应尽可能早地恢复目标模型响应。对于区域子集的穷举搜索会产生指数级成本，而广泛使用的贪心搜索仍需二次数量的模型评估——因为每次选择步骤均需对所有剩余候选区域重新评分。本文提出PhaseWin，一种面向忠实视觉归因的高效子集搜索算法。PhaseWin将贪心区域选择重构为分阶段窗口搜索流程：无需在每一步重新评估完整候选集，而是交替进行全局候选筛选、自适应剪枝与局部窗口细化，同时保留贪心搜索的核心区域排序行为。我们在单调证据积累条件下分析PhaseWin，并证明在特征级结构假设下，该方法可实现可控的线性评估复杂度，同时提供接近贪心搜索的忠实性保证。在图像分类、目标检测、视觉定位与图像描述任务上的大量实验表明，在所有对比归因方法中，PhaseWin以最少的前向传播次数达到高忠实度，经验验证了从$O(n^2)$到$O(n)$的复杂度缩减。代码已开源至https://github.com/Qihuai27/phasewin-va。