In recent years we have been able to gather large amounts of genomic data at a fast rate, creating situations where the number of variables greatly exceeds the number of observations. In these situations, most models that can handle a moderately high dimension will now become computationally infeasible. Hence, there is a need for a pre-screening of variables to reduce the dimension efficiently and accurately to a more moderate scale. There has been much work to develop such screening procedures for independent outcomes. However, much less work has been done for high-dimensional longitudinal data, in which the observations can no longer be assumed to be independent. In addition, it is of interest to capture possible interactions between the genomic variable and time in many of these longitudinal studies. This calls for the development of new screening procedures for high-dimensional longitudinal data, where the focus is on interactions with time. In this work, we propose a novel conditional screening procedure that ranks variables according to the likelihood value at the maximum likelihood estimates in a semi-marginal linear mixed model, where the genomic variable and its interaction with time are included in the model. This is to our knowledge the first conditional screening approach for clustered data. We prove that this approach enjoys the sure screening property, and assess the finite sample performance of the method through simulations, with a comparison of an already existing screening approach based on generalized estimating equations.
翻译:近年来,我们能够以快速获取大量基因组数据,导致变量数量远超观测值数量的情形。在此类场景中,大多数能够处理中等高维数据的模型将因计算不可行而失效。因此,需要一种预筛选机制,以高效且精确地将维度降低至更适中的规模。针对独立响应变量的筛选方法已有大量研究,但针对观测值无法独立假设的高维纵向数据的研究相对较少。此外,在许多纵向研究中,捕捉基因组变量与时间之间的潜在交互作用具有重要意义。这要求开发适用于高维纵向数据的新筛选方法,重点关注与时间的交互效应。本文提出一种新颖的条件筛选方法:在半边缘线性混合模型中引入基因组变量及其与时间的交互项,基于最大似然估计的似然函数值对变量进行排序。据我们所知,这是首个针对聚类数据的条件筛选方法。我们证明该方法具备必然筛选性质,并通过模拟实验评估其有限样本性能,同时与现有基于广义估计方程的筛选方法进行对比。