SEMMS with Random Effects: A Mixed-Model Extension for Variable Selection in Clustered and Longitudinal Data

SEMMS (Scalable Empirical-Bayes Model for Marker Selection) is a variable-selection procedure for generalized linear models that uses a three-component normal mixture prior on regression coefficients. In its original form, SEMMS assumes that all observations are independent. Many real-world datasets, however, arise from repeated-measures or clustered designs in which observations within the same subject are correlated. Ignoring this correlation inflates the apparent residual variance and can severely degrade variable-selection performance. We extend SEMMS to accommodate random intercepts, random slopes, or both, via an alternating coordinate-ascent algorithm. After each round of fixed-effect variable selection, the subject-level best linear unbiased predictors (BLUPs) are updated with \texttt{lmer} (Gaussian) or \texttt{glmer} (non-Gaussian); the fixed-effect step then operates on the random-effect-adjusted response. We describe the algorithm, evaluate its performance in three Gaussian simulation studies spanning a range of signal strengths, random-effect magnitudes, and sample/predictor-space regimes, and present a semi-synthetic real-data example. We further extend the framework to non-Gaussian families (Poisson, binomial) via an IRLS working-response adaptation: at each outer iteration the fixed-effects step uses the RE-adjusted working response computed from the current \texttt{glmer} fitted values rather than the raw response. When the fixed-effect signal is strong relative to the random-effect variance, both the original and extended procedures perform comparably. When the random-effect variance dominates -- the scenario most likely to cause plain SEMMS to fail -- the mixed-model extension recovers the exact true predictor set in 93\% of simulated datasets (Gaussian), 61\% (Poisson), and 65\% (binomial), compared with 1\%, 45\%, and 39\% for plain SEMMS respectively.

翻译：SEMMS（可扩展的标记选择经验贝叶斯模型）是一种针对广义线性模型的变量选择方法，其通过对回归系数施加三成分正态混合先验实现变量筛选。原始SEMMS假设所有观测值相互独立。然而，许多实际数据集源于重复测量或聚类设计，同一受试者内的观测值存在相关性。忽略这种相关性会夸大表观残差方差，并严重削弱变量选择性能。我们通过交替坐标上升算法，将SEMMS扩展至可容纳随机截距、随机斜率或两者兼具的情形。在每轮固定效应变量选择后，受试者级别的最佳线性无偏预测值（BLUP）通过\texttt{lmer}（高斯分布）或\texttt{glmer}（非高斯分布）进行更新；固定效应步骤随后对经随机效应调整的响应变量进行操作。我们描述了该算法，通过三项覆盖不同信号强度、随机效应幅度及样本/预测空间规模的高斯模拟研究评估其性能，并呈现一个半合成真实数据示例。进一步地，我们通过IRLS工作响应适配将该框架扩展至非高斯族（泊松分布、二项分布）：在每个外循环迭代中，固定效应步骤使用基于当前\texttt{glmer}拟合值计算的经随机效应调整的工作响应，而非原始响应。当固定效应信号强于随机效应方差时，原始方法与扩展方法表现相当。而当随机效应方差占主导地位——这正是导致标准SEMMS失效的主要情景——混合模型扩展在高斯模拟数据集中准确恢复真实预测变量集的概率为93%（泊松分布为61%，二项分布为65%），而标准SEMMS的对应概率仅为1%、45%和39%。