How to select the active variables which have significant impact on the event of interest is a very important and meaningful problem in the statistical analysis of ultrahigh-dimensional data. Sure independent screening procedure has been demonstrated to be an effective method to reduce the dimensionality of data from a large scale to a relatively moderate scale. For censored survival data, the existing screening methods mainly adopt the Kaplan--Meier estimator to handle censoring, which may not perform well for scenarios which have heavy censoring rate. In this article, we propose a model-free screening procedure based on the Hilbert-Schmidt independence criterion (HSIC). The proposed method avoids the complication to specify an actual model from a large number of covariates. Compared with existing screening procedures, this new approach has several advantages. First, it does not involve the Kaplan--Meier estimator, thus its performance is much more robust for the cases with a heavy censoring rate. Second, the empirical estimate of HSIC is very simple as it just depends on the trace of a product of Gram matrices. In addition, the proposed procedure does not require any complicated numerical optimization, so the corresponding calculation is very simple and fast. Finally, the proposed procedure which employs the kernel method is substantially more resistant to outliers. Extensive simulation studies demonstrate that the proposed method has favorable exhibition over the existing methods. As an illustration, we apply the proposed method to analyze the diffuse large-B-cell lymphoma (DLBCL) data and the ovarian cancer data.
翻译:如何筛选对感兴趣事件具有显著影响的活跃变量,是超高维数据统计分析中一个非常重要且富有意义的问题。确定性独立筛选方法已被证明是将数据维度从大规模降至相对中等规模的有效手段。对于删失生存数据,现有筛选方法主要采用Kaplan–Meier估计量处理删失,但在删失率较高的情况下其表现可能欠佳。本文提出一种基于希尔伯特-施密特独立性准则(HSIC)的无模型筛选方法。所提方法避免了从大量协变量中指定具体模型的复杂性。与现有筛选方法相比,这一新方法具有若干优势:首先,它不涉及Kaplan–Meier估计量,因此在删失率较高的情况下其性能更加稳健;其次,HSIC的经验估计非常简单,仅依赖于Gram矩阵乘积的迹;此外,所提方法无需复杂的数值优化,相应计算简便快捷;最后,该方法采用核技巧,对异常值具有更强的鲁棒性。大量仿真研究表明,所提方法优于现有方法。作为应用实例,我们将该方法分别应用于弥漫性大B细胞淋巴瘤(DLBCL)数据和卵巢癌数据的分析。