Approximations to the Fisher Information Metric of Deep Generative Models for Out-Of-Distribution Detection

Likelihood-based deep generative models such as score-based diffusion models and variational autoencoders are state-of-the-art machine learning models approximating high-dimensional distributions of data such as images, text, or audio. One of many downstream tasks they can be naturally applied to is out-of-distribution (OOD) detection. However, seminal work by Nalisnick et al. which we reproduce showed that deep generative models consistently infer higher log-likelihoods for OOD data than data they were trained on, marking an open problem. In this work, we analyse using the gradient of a data point with respect to the parameters of the deep generative model for OOD detection, based on the simple intuition that OOD data should have larger gradient norms than training data. We formalise measuring the size of the gradient as approximating the Fisher information metric. We show that the Fisher information matrix (FIM) has large absolute diagonal values, motivating the use of chi-square distributed, layer-wise gradient norms as features. We combine these features to make a simple, model-agnostic and hyperparameter-free method for OOD detection which estimates the joint density of the layer-wise gradient norms for a given data point. We find that these layer-wise gradient norms are weakly correlated, rendering their combined usage informative, and prove that the layer-wise gradient norms satisfy the principle of (data representation) invariance. Our empirical results indicate that this method outperforms the Typicality test for most deep generative models and image dataset pairings.

翻译：基于似然的深度生成模型（如基于分数的扩散模型和变分自编码器）是当前最先进的机器学习模型，能够近似图像、文本或音频等高维数据分布。这类模型可自然应用于分布外检测等下游任务。然而，我们复现了Nalisnick等人的开创性工作，发现深度生成模型对分布外数据的对数似然推断结果始终高于其训练数据，这构成了一个未解决的难题。本研究基于分布外数据应比训练数据具有更大梯度范数的直观认知，通过分析数据点相对于深度生成模型参数的梯度来进行分布外检测。我们将梯度量度的测量形式化为对Fisher信息度量的近似。研究表明，Fisher信息矩阵的对角线绝对值较大，这为采用卡方分布的逐层梯度范数作为特征提供了理论依据。我们整合这些特征提出了一种简单、模型无关且无需超参数的分布外检测方法，该方法能够估计给定数据点的逐层梯度范数的联合密度。实验发现这些逐层梯度范数之间弱相关，使其联合使用具有信息增益，并证明了逐层梯度范数满足（数据表示的）不变性原理。实证结果表明，对于多数深度生成模型与图像数据集组合，本方法优于典型性检验方法。