High-dimensional data sets are often available in genome-enabled predictions. Such data sets include nonlinear relationships with complex dependence structures. For such situations, vine copula based (quantile) regression is an important tool. However, the current vine copula based regression approaches do not scale up to high and ultra-high dimensions. To perform high-dimensional sparse vine copula based regression, we propose two methods. First, we show their superiority regarding computational complexity over the existing methods. Second, we define relevant, irrelevant, and redundant explanatory variables for quantile regression. Then we show our method's power in selecting relevant variables and prediction accuracy in high-dimensional sparse data sets via simulation studies. Next, we apply the proposed methods to the high-dimensional real data, aiming at the genomic prediction of maize traits. Some data-processing and feature extraction steps for the real data are further discussed. Finally, we show the advantage of our methods over linear models and quantile regression forests in simulation studies and real data applications.
翻译:高维数据集在基因组预测中普遍存在,这类数据往往包含具有复杂依赖结构的非线性关系。针对此类情景,基于藤蔓Copula的(分位数)回归是一个重要工具。然而,现有基于藤蔓Copula的回归方法无法扩展至高维及超高维场景。为构建高维稀疏藤蔓Copula回归方法,我们提出两种新方法。首先,我们证明了所提方法在计算复杂度上优于现有方法。其次,我们定义了分位数回归中的相关、无关及冗余解释变量。通过模拟研究,我们展示了所提方法在高维稀疏数据集中筛选相关变量及预测精度方面的优势。随后,我们将所提方法应用于真实高维数据,旨在实现玉米性状的基因组预测,并进一步讨论了真实数据的数据处理及特征提取步骤。最后,通过模拟研究与真实数据应用证明,所提方法在线性模型与分位数回归森林中均展现出显著优势。