Gaussian graphical models can be used to extract conditional dependencies between the features of the dataset. This is often done by making an independence assumption about the samples, but this assumption is rarely satisfied in reality. However, state-of-the-art approaches that avoid this assumption are not scalable, with $O(n^3)$ runtime and $O(n^2)$ space complexity. In this paper, we introduce a method that has $O(n^2)$ runtime and $O(n)$ space complexity, without assuming independence. We validate our model on both synthetic and real-world datasets, showing that our method's accuracy is comparable to that of prior work We demonstrate that our approach can be used on unprecedentedly large datasets, such as a real-world 1,000,000-cell scRNA-seq dataset; this was impossible with previous approaches. Our method maintains the flexibility of prior work, such as the ability to handle multi-modal tensor-variate datasets and the ability to work with data of arbitrary marginal distributions. An additional advantage of our method is that, unlike prior work, our hyperparameters are easily interpretable.
翻译:高斯图模型可用于提取数据集中特征间的条件依赖关系。现有方法通常假设样本间相互独立,但这一假设在现实中很少成立。然而,避免该假设的现有前沿方法缺乏可扩展性,其时间复杂度为$O(n^3)$,空间复杂度为$O(n^2)$。本文提出一种无需独立性假设的方法,其时间复杂度为$O(n^2)$,空间复杂度为$O(n)$。我们在合成数据集和真实数据集上验证了模型性能,结果表明本方法的精度与现有工作相当。我们证明了该方法能够处理前所未有的大规模数据集,例如包含100万个细胞的真实单细胞RNA测序数据集——这是以往方法无法实现的。本方法保持了现有工作的灵活性,例如能够处理多模态张量数据集,并能适应任意边缘分布的数据。相较于现有工作,本方法的额外优势在于其超参数具有易于解释的特性。