We introduce the loss kernel, an interpretability method for measuring similarity between data points according to a trained neural network. The kernel is the covariance matrix of per-sample losses computed under a distribution of low-loss-preserving parameter perturbations. We first validate our method on a synthetic multitask problem, showing it separates inputs by task as predicted by theory. We then apply this kernel to Inception-v1 to visualize the structure of ImageNet, and we show that the kernel's structure aligns with the WordNet semantic hierarchy. This establishes the loss kernel as a practical tool for interpretability and data attribution.
翻译:我们提出损失核,这是一种根据训练好的神经网络测量数据点之间相似性的可解释性方法。该核是在保持低损失的参数扰动分布下计算的逐样本损失协方差矩阵。我们首先在一个合成的多任务问题上验证了我们的方法,结果表明它能够按理论预测的任务分离输入。随后,我们将该核应用于Inception-v1以可视化ImageNet的结构,并证明该核的结构与WordNet语义层次结构一致。这确立了损失核作为一种实用的可解释性与数据归因工具。