For many scientific questions, understanding the underlying mechanism is the goal. To help investigators better understand the underlying mechanism, variable selection is a crucial step that permits the identification of the most associated regression variables of interest. A variable selection method consists of model evaluation using an information criterion and a search of the model space. Here, we provide a comprehensive comparison of variable selection methods using performance measures of correct identification rate (CIR), recall, and false discovery rate (FDR). We consider the BIC and AIC for evaluating models, and exhaustive, greedy, LASSO path, and stochastic search approaches for searching the model space; we also consider LASSO using cross validation. We perform simulation studies for linear and generalized linear models that parametrically explore a wide range of realistic sample sizes, effect sizes, and correlations among regression variables. We consider model spaces with a small and larger number of potential regressors. The results show that the exhaustive search BIC and stochastic search BIC outperform the other methods when considering the performance measures on small and large model spaces, respectively. These approaches result in the highest CIR and lowest FDR, which collectively may support long-term efforts towards increasing replicability in research.
翻译:对于许多科学问题而言,理解其内在机制是核心目标。为帮助研究者更好地理解潜在机制,变量选择成为关键步骤,它能够识别与目标最相关的回归变量。变量选择方法包含基于信息准则的模型评估和模型空间的搜索两个组成部分。本文通过正确识别率(CIR)、召回率和错误发现率(FDR)等性能指标,对变量选择方法进行了系统性比较。我们采用BIC和AIC进行模型评估,并采用穷举搜索、贪心算法、LASSO路径和随机搜索等策略探索模型空间,同时考虑了基于交叉验证的LASSO方法。针对线性和广义线性模型,我们通过参数化模拟研究了不同样本量、效应量及回归变量间相关性的广泛实际场景,并分别考察了预测变量数量较少和较多的模型空间。结果表明:在较小模型空间中,穷举搜索结合BIC的方法表现最优;在较大模型空间中,随机搜索结合BIC的方法最具优势。这两种方法能实现最高的正确识别率和最低的错误发现率,共同为提升研究可重复性的长期目标提供了有力支持。