Automated variable selection is widely applied in statistical model development. Algorithms like forward, backward or stepwise selection are available in statistical software packages like R and SAS. Many researchers have criticized the use of these algorithms because the models resulting from automated selection algorithms are not based on theory and tend to be unstable. Furthermore, simulation studies have shown that they often select incorrect variables due to random effects which makes these model building strategies unreliable. In this article, a comprehensive stepwise selection algorithm tailored to logistic regression is proposed. It uses multiple criteria in variable selection instead of relying on one single measure only, like a $p$-value or Akaike's information criterion, which ensures robustness and soundness of the final outcome. The result of the selection process might not be unambiguous. It might select multiple models that could be considered as statistically equivalent. A simulation study demonstrates the superiority of the proposed variable selection method over available alternatives.
翻译:自动化变量选择广泛应用于统计模型开发中。像前向选择、后向选择或逐步选择等算法已在R和SAS等统计软件包中实现。许多研究者批评了这些算法的使用,因为通过自动化选择算法得到的模型缺乏理论基础且往往不稳定。此外,模拟研究表明,由于随机效应,这些方法常会选错变量,使得模型构建策略不可靠。本文针对逻辑回归提出了一种综合逐步选择算法。该算法在变量选择中使用多重标准,而非仅依赖单一度量(如$p$值或赤池信息准则),从而确保最终结果的稳健性和可靠性。选择过程的结果可能并非唯一,它可能选出多个在统计上视为等价的模型。模拟研究证明了所提出的变量选择方法相较于现有替代方案的优越性。