In variable selection, a selection rule that prescribes the permissible sets of selected variables (called a "selection dictionary") is desirable due to the inherent structural constraints among the candidate variables. Such selection rules can be complex in real-world data analyses, and failing to incorporate such restrictions could not only compromise the interpretability of the model but also lead to decreased prediction accuracy. However, no general framework has been proposed to formalize selection rules and their applications, which poses a significant challenge for practitioners seeking to integrate these rules into their analyses. In this work, we establish a framework for structured variable selection that can incorporate universal structural constraints. We develop a mathematical language for constructing arbitrary selection rules, where the selection dictionary is formally defined. We demonstrate that all selection rules can be expressed as combinations of operations on constructs, facilitating the identification of the corresponding selection dictionary. Once this selection dictionary is derived, practitioners can apply their own user-defined criteria to select the optimal model. Additionally, our framework enhances existing penalized regression methods for variable selection by providing guidance on how to appropriately group variables to achieve the desired selection rule. Furthermore, our innovative framework opens the door to establishing new l0 norm-based penalized regression techniques that can be tailored to respect arbitrary selection rules, thereby expanding the possibilities for more robust and tailored model development.
翻译:在变量选择中,由于候选变量间存在固有结构约束,制定允许的选定变量集(称为“选择字典”)的筛选规则具有重要价值。实际数据分析中的此类筛选规则可能非常复杂,忽视这些约束不仅会削弱模型的可解释性,还可能降低预测精度。然而,目前尚未有通用框架对筛选规则及其应用进行形式化定义,这为研究人员将此类规则整合至分析中带来了重大挑战。本研究建立了可整合通用结构约束的结构化变量选择框架。我们开发了一套数学语言用于构建任意筛选规则,并正式定义了选择字典。研究表明所有筛选规则均可表示为结构操作的组合形式,这有助于识别对应的选择字典。一旦获得该选择字典,研究者便可应用自定义准则选择最优模型。此外,本框架通过指导如何合理分组变量以实现目标筛选规则,增强了现有用于变量选择的惩罚回归方法。更重要的是,这一创新框架为建立基于l0范数的定制化惩罚回归技术开辟了新途径,该技术能够遵循任意筛选规则,从而拓展了构建更稳健、更定制化模型的可能性。