Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.
翻译:医疗领域的人工智能需要兼具准确性与可解释性的模型。本研究通过将医学稀疏自编码器(MedSAE)应用于MedCLIP的潜在空间,推进了医学视觉领域的机制可解释性研究。MedCLIP是一种基于胸部X光片及报告训练的视觉-语言模型。为量化可解释性,我们提出了一个结合相关性度量、熵分析以及通过MedGEMMA基础模型实现自动神经元命名的评估框架。在CheXpert数据集上的实验表明,MedSAE神经元相较于原始MedCLIP特征具有更高的单义性与可解释性。我们的研究成果搭建了高性能医疗AI与模型透明度之间的桥梁,为获得临床可靠的表征提供了可扩展的解决方案。