Explainable AI methods for vision models aim to identify the parts of the input that are important for the final prediction and subsequently relate these regions to human-understandable concepts. Here, we propose focusing solely on the encoder and relating its intermediate outputs to the input, instead. We introduce Neuro-Activated Vision Explanations (NAVE), a post-hoc, unsupervised, and architecture-agnostic (across CNNs and ViTs) method for extracting and visualizing internal representations from frozen vision model encoders. Specifically, NAVE clusters composite feature activations from multiple encoder depths to produce interpretable segmentation maps with controllable granularity, requiring no fine-tuning or architectural modifications. Through extensive experiments, we quantitatively demonstrate that NAVE's concepts align with input semantics and can be used in downstream tasks. We further demonstrate NAVE as an inspection tool by analyzing how training strategies and architectures affect encoder representations. Overall, our results establish NAVE as an effective tool for post-hoc model inspection and enhancing transparency in vision models. \texttt{https://github.com/Ahcene-B/NAVE}
翻译:暂无翻译