Variable selection via knockoffs for clustered data

We extend the knockoffs method for selecting predictors to clustered data (cross-sectional or repeated measures). In the setting of clustered data, variable selection is complex because some predictors are measured at the observation level (level 1), whereas others are measured at the cluster level (level 2), so their values are constant within clusters. The solution we propose is to conduct variable selection separately at the two levels. To this end, we suggest a two-step approach: (i) decompose each level 1 predictor into level 2 and level 1 components by replacing it with the cluster mean and the deviation from the cluster mean; (ii) perform variable selection separately at the two levels, where the level 1 data matrix includes the deviations from the cluster means and the level 2 data matrix includes the cluster means of level 1 predictors and the level 2 predictors. To evaluate the performance of the proposed approach, we conduct a simulation study comparing the sequential knockoff, the derandomized knockoff, and the Lasso. The study shows satisfactory results in terms of false discovery rate and power. All methods fail when applied to the complete data matrix, including both level 1 and level 2 predictors. In contrast, all methods perform better when applied to the level 1 and level 2 data matrices separately. Moreover, the sequential knockoffs method performs substantially better than the Lasso and the derandomized knockoffs. Our proposal to implement the knockoffs method in a clustered data framework is feasible, flexible, and effective.

翻译：我们将用于选择预测变量的knockoffs方法扩展至聚类数据（横截面或重复测量数据）。在聚类数据背景下，变量选择较为复杂，因为部分预测变量在观测层面（第一层）测量，而其他预测变量在聚类层面（第二层）测量，其值在聚类内部保持恒定。我们提出的解决方案是在两个层级分别进行变量选择。为此，我们建议采用两步法：（i）通过将每个第一层预测变量替换为聚类均值及其与聚类均值的偏差，将其分解为第二层和第一层分量；（ii）在两个层级分别执行变量选择，其中第一层数据矩阵包含与聚类均值的偏差，第二层数据矩阵包含第一层预测变量的聚类均值以及第二层预测变量。为评估所提方法的性能，我们进行了模拟研究，比较了序贯knockoff、去随机化knockoff和Lasso方法。研究在错误发现率和统计功效方面显示出令人满意的结果。当应用于包含第一层和第二层预测变量的完整数据矩阵时，所有方法均失效。相比之下，当分别应用于第一层和第二层数据矩阵时，所有方法表现更优。此外，序贯knockoffs方法的表现显著优于Lasso和去随机化knockoffs方法。我们在聚类数据框架中实施knockoffs方法的方案具有可行性、灵活性和有效性。