Deep neural networks have achieved remarkable performance across medical imaging tasks, yet their tendency to overgeneralize under distributional shifts poses a major obstacle to safe clinical deployment. Out-of-Distribution (OOD) detection methods aim to mitigate this risk, but most existing approaches rely on opaque internal signals with poorly understood semantic meaning, limiting trust in safety-critical settings. In this work, we propose an interpretable OOD detection framework that probes the stability of model predictions under class-conditioned semantic perturbations. Leveraging sparse autoencoders (SAEs), we learn class-specific concept vectors from in-distribution data that disentangle dense intermediate representations into sparse, semantically meaningful components. At inference, we perturb deeper-layer representations using the concept vectors associated with the model's predicted class and measure the class logits stability. We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal and an interpretable lens into the internal mechanisms driving model uncertainty, making it particularly suitable for high stakes medical applications.
翻译:深度神经网络在医学影像任务中取得了显著性能,但其在分布偏移下过度泛化的倾向对安全临床部署构成了重大障碍。分布外检测方法旨在缓解这一风险,但现有方法大多依赖语义含义不明的黑箱内部信号,限制了在安全关键场景中的可信度。本文提出了一种可解释的分布外检测框架,通过探究模型在类别条件语义扰动下的预测稳定性来实现检测。借助稀疏自编码器,我们从分布内数据中学习类别特定概念向量,将密集的中间表征解耦为稀疏且具有语义含义的成分。在推理阶段,我们使用模型预测类别对应的概念向量扰动深层表征,并度量类别逻辑的稳定性。我们假设:分布内样本对此类扰动表现出低敏感性——因其表征与类别特定语义方向对齐;而分布外样本因表征错位会呈现放大偏差。通过将分布外检测建模为概念条件稳定性分析,本方法既提供了具有判别力的分布外信号,又为驱动模型不确定性的内部机制提供了可解释视角,特别适用于高风险医疗应用场景。