Feature screening is an important tool in analyzing ultrahigh-dimensional data, particularly in the field of Omics and oncology studies. However, most attention has been focused on identifying features that have a linear or monotonic impact on the response variable. Detecting a sparse set of variables that have a nonlinear or non-monotonic relationship with the response variable is still a challenging task. To fill the gap, this paper proposed a robust model-free screening approach for right-censored survival data by providing a new perspective of quantifying the covariate effect on the restricted mean survival time, rather than the routinely used hazard function. The proposed measure, based on the difference between the restricted mean survival time of covariate-stratified and overall data, is able to identify comprehensive types of associations including linear, nonlinear, non-monotone, and even local dependencies like change points. This approach is highly interpretable and flexible without any distribution assumption. The sure screening property is established and an iterative screening procedure is developed to address multicollinearity between high-dimensional covariates. Simulation studies are carried out to demonstrate the superiority of the proposed method in selecting important features with a complex association with the response variable. The potential of applying the proposed method to handle interval-censored failure time data has also been explored in simulations, and the results have been promising. The method is applied to a breast cancer dataset to identify potential prognostic factors, which reveals potential associations between breast cancer and lymphoma.
翻译:特征筛选是分析超高维数据的重要工具,尤其在组学和肿瘤学研究中。然而,现有研究主要集中于识别对响应变量具有线性或单调影响的特征。如何检测与响应变量存在非线性或非单调关系的稀疏变量集仍是一项具有挑战性的任务。为弥补这一空白,本文提出了一种针对右删失生存数据的稳健无模型筛选方法,通过量化协变量对限制平均生存时间(而非常规使用的风险函数)的影响提供新视角。该度量基于协变量分层数据与总体数据之间限制平均生存时间的差异,能够识别包括线性、非线性、非单调关系乃至变点等局部依赖性在内的全面关联类型。该方法无需任何分布假设,具有高度可解释性和灵活性。本文建立了确定筛选性质,并开发了迭代筛选程序以处理高维协变量间的多重共线性问题。通过模拟研究证明了该方法在筛选与响应变量存在复杂关联的重要特征方面的优越性。模拟实验还探索了将该方法应用于区间删失失效时间数据的潜力,结果令人鼓舞。将该方法应用于乳腺癌数据集以识别潜在预后因子,揭示了乳腺癌与淋巴瘤之间可能的关联。