The studies of large-scale, high-dimensional data in fields such as genomics and neuroscience have injected new insights into science. Yet, despite advances, they are confronting several challenges often simultaneously: non-linearity, slow computation, inconsistency and uncertain convergence, and small sample sizes compared to high feature dimensions. Here, we propose a relatively simple, scalable, and consistent nonlinear dimension reduction method that can potentially address these issues in unsupervised settings. We call this method Statistical Quantile Learning (SQL) because, methodologically, it leverages on a quantile approximation of the latent variables and standard nonparametric techniques (sieve or penalyzed methods). By doing so, we show that estimating the model originate from a convex assignment matching problem. Theoretically, we provide the asymptotic properties of SQL and its rates of convergence. Operationally, SQL overcomes both the parametric restriction in nonlinear factor models in statistics and the difficulty of specifying hyperparameters and vanishing gradients in deep learning. Simulation studies assent the theory and reveal that SQL outperforms state-of-the-art statistical and machine learning methods. Compared to its linear competitors, SQL explains more variance, yields better separation and explanation, and delivers more accurate outcome prediction when latent factors are used as predictors; compared to its nonlinear competitors, SQL shows considerable advantage in interpretability, ease of use and computations in high-dimensional settings.Finally, we apply SQL to high-dimensional gene expression data (consisting of 20263 genes from 801 subjects), where the proposed method identified latent factors predictive of five cancer types. The SQL package is available at https://github.com/jbodelet/SQL.
翻译:基因组学和神经科学等领域对大规模高维数据的研究为科学注入了新的见解。然而,尽管取得了进展,这些研究仍面临多项通常同时出现的挑战:非线性、计算缓慢、不一致性和不确定性收敛,以及相对于高特征维度而言的小样本量。在此,我们提出一种相对简单、可扩展且一致的非线性降维方法,该方法有望在无监督设置中解决这些问题。我们将此方法称为统计分位数学习(SQL),因为从方法论上讲,它利用了潜变量的分位数近似和标准非参数技术(筛法或惩罚方法)。通过这样做,我们表明模型估计源于一个凸分配匹配问题。理论上,我们提供了SQL的渐近性质及其收敛速度。操作上,SQL克服了统计学中非线性因子模型的参数限制以及深度学习中超参数指定和梯度消失的困难。模拟研究证实了该理论,并揭示SQL优于最先进的统计和机器学习方法。与其线性竞争对手相比,SQL解释了更多方差,实现了更好的分离和解释,并在使用潜变量作为预测变量时提供了更准确的结果预测;与其非线性竞争对手相比,SQL在高维设置中的可解释性、易用性和计算方面显示出显著优势。最后,我们将SQL应用于高维基因表达数据(包含801名受试者的20263个基因),其中所提出方法识别出可预测五种癌症类型的潜变量。SQL软件包可在https://github.com/jbodelet/SQL获取。