Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of great interest. In recent years, many approaches were proposed, among which unsupervised deep learning (UDL) has received a great deal of attention. Two interesting questions are (a) how to combine the strengths of UDL and other approaches, and (b) how these approaches compare to one other. We combine Variational Auto-Encoder (VAE), a popular UDL approach, with the recent idea of Influential Feature PCA (IF-PCA), and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on $10$ gene microarray data sets and $8$ single-cell RNA-seq data sets. We find that IF-VAE significantly improves over VAE, but still underperforms IF-PCA. We also find that IF-PCA is quite competitive, which slightly outperforms Seurat and SC3 over the $8$ single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving the phase transition in a Rare/Weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).
翻译:受试者聚类(即利用测量特征将受试者(如患者或细胞)划分为多个组群)是一个备受关注的问题。近年来提出了多种方法,其中无监督深度学习(UDL)受到了广泛关注。两个有趣的问题是:(a)如何结合UDL与其他方法的优势,以及(b)这些方法之间的相互比较。我们将流行的UDL方法——变分自编码器(VAE)与近期提出的影响特征PCA(IF-PCA)思想相结合,提出了IF-VAE作为受试者聚类的新方法。我们在10个基因微阵列数据集和8个单细胞RNA-seq数据集上研究了IF-VAE,并将其与包括IF-PCA、VAE、Seurat和SC3在内的其他方法进行了比较。我们发现IF-VAE相较于VAE有显著提升,但仍不如IF-PCA。同时,我们发现IF-PCA具有较强的竞争力,在8个单细胞数据集上略优于Seurat和SC3。IF-PCA概念简单,便于进行精细分析。我们证明IF-PCA能够在稀有/弱信号模型中实现相变。相比之下,Seurat和SC3更为复杂且理论上难以分析(正因如此,其最优性尚不明确)。