Automated variable selection is widely applied in statistical model development. Algorithms like forward, backward or stepwise selection are available in statistical software packages like R and SAS. Many researchers have criticized the use of these algorithms because the models resulting from automated selection algorithms are not based on theory and tend to be unstable. Furthermore, simulation studies have shown that they often select incorrect variables due to random effects which makes these model building strategies unreliable. In this article, a comprehensive stepwise selection algorithm tailored to logistic regression is proposed. It uses multiple criteria in variable selection instead of relying on one single measure only, like a $p$-value or Akaike's information criterion, which ensures robustness and soundness of the final outcome. The result of the selection process might not be unambiguous. It might select multiple models that could be considered as statistically equivalent. A simulation study demonstrates the superiority of the proposed variable selection method over available alternatives.
翻译:自动化变量选择广泛应用于统计模型开发中。R和SAS等统计软件包提供了前向选择、后向选择或逐步选择等算法。许多研究者批评了这些算法的使用,因为自动化选择算法生成的模型缺乏理论基础且往往不稳定。此外,模拟研究表明,由于随机效应的影响,这些算法经常选择错误的变量,导致模型构建策略不可靠。本文提出了一种专门针对逻辑回归的全方位逐步选择算法。该算法在变量选择过程中采用多重准则,而非仅依赖单一度量指标(如$p$值或赤池信息准则),从而确保最终结果的稳健性与合理性。选择过程的结果可能并非唯一,可能会选出多个在统计上可视为等价的模型。模拟研究表明,所提出的变量选择方法优于现有的替代方法。