Unsupervised black-box models are drivers of scientific discovery, yet are difficult to interpret, as their output is often a multidimensional embedding rather than a well-defined target. While explainability for supervised learning uncovers how input features contribute to predictions, its unsupervised counterpart should relate input features to the structure of the learned embeddings. However, adaptations of supervised model explainability for unsupervised learning provide either single-sample or dataset-summary explanations, remaining too fine-grained or reductive to be meaningful, and cannot explain embeddings without mapping functions. To bridge this gap, we propose LAVA, a post-hoc model-agnostic method to explain local embedding organization through feature covariation in the original input data. LAVA explanations comprise modules, capturing local subpatterns of input feature correlation that reoccur globally across the embeddings. LAVA delivers stable explanations at a desired level of granularity, revealing domain-relevant patterns such as visual parts of images or disease signals in cellular processes, otherwise missed by existing methods.
翻译:无监督黑盒模型是科学发现的重要驱动力,但其可解释性较差,因为其输出通常是多维嵌入而非明确定义的目标。监督学习的可解释性揭示了输入特征如何影响预测,而无监督学习的可解释性则应将输入特征与所学嵌入的结构联系起来。然而,将监督模型可解释性方法应用于无监督学习时,仅能提供单样本或数据集层面的解释,这些解释要么过于细粒度,要么过于简化而缺乏意义,且无法解释没有映射函数的嵌入。为填补这一空白,我们提出LAVA,一种后验的模型无关方法,通过原始输入数据中的特征协变来解释局部嵌入的组织结构。LAVA的解释由模块组成,这些模块捕捉输入特征相关性的局部子模式,并在嵌入空间中全局重复出现。LAVA能够在所需粒度上提供稳定的解释,揭示领域相关模式,如图像的视觉部分或细胞过程中的疾病信号,这些模式是现有方法所无法发现的。