While numerous methods have been proposed for computing distances between probability distributions in Euclidean space, relatively little attention has been given to computing such distances for distributions on graphs. However, there has been a marked increase in data that either lies on graph (such as protein interaction networks) or can be modeled as a graph (single cell data), particularly in the biomedical sciences. Thus, it becomes important to find ways to compare signals defined on such graphs. Here, we propose Graph Fourier MMD (GFMMD), a novel distance between distributions and signals on graphs. GFMMD is defined via an optimal witness function that is both smooth on the graph and maximizes difference in expectation between the pair of distributions on the graph. We find an analytical solution to this optimization problem as well as an embedding of distributions that results from this method. We also prove several properties of this method including scale invariance and applicability to disconnected graphs. We showcase it on graph benchmark datasets as well on single cell RNA-sequencing data analysis. In the latter, we use the GFMMD-based gene embeddings to find meaningful gene clusters. We also propose a novel type of score for gene selection called "gene localization score" which helps select genes for cellular state space characterization.
翻译:尽管已有大量方法用于计算欧氏空间中概率分布之间的距离,但针对图结构上分布距离计算的研究相对较少。然而,随着生物医学等领域的快速发展,位于图上的数据(如蛋白质相互作用网络)或可建模为图的数据(如单细胞测序数据)显著增加。因此,如何比较定义在这些图上的信号成为重要课题。本文提出图傅里叶最大均值差异(Graph Fourier MMD, GFMMD),一种用于图上分布和信号间距离度量的新方法。GFMMD通过最优见证函数定义,该函数在图上光滑且能最大化图上一对分布之间的期望差异。我们给出了该优化问题的解析解,并获得了该方法产生的分布嵌入。同时,我们证明了该方法的若干性质,包括尺度不变性及适用于不连通图。我们在图基准数据集以及单细胞RNA测序数据分析中展示了该方法。在后一应用中,我们利用基于GFMMD的基因嵌入来发现有意义的基因簇。此外,我们提出了一种新型基因选择评分——"基因定位评分"(gene localization score),用于辅助细胞状态空间表征的基因筛选。