In variable selection, a selection rule that prescribes the permissible sets of selected variables (called a "selection dictionary") is desirable due to the inherent structural constraints among the candidate variables. Such selection rules can be complex in real-world data analyses, and failing to incorporate such restrictions could not only compromise the interpretability of the model but also lead to decreased prediction accuracy. However, no general framework has been proposed to formalize selection rules and their applications, which poses a significant challenge for practitioners seeking to integrate these rules into their analyses. In this work, we establish a framework for structured variable selection that can incorporate universal structural constraints. We develop a mathematical language for constructing arbitrary selection rules, where the selection dictionary is formally defined. We demonstrate that all selection rules can be expressed as combinations of operations on constructs, facilitating the identification of the corresponding selection dictionary. Once this selection dictionary is derived, practitioners can apply their own user-defined criteria to select the optimal model. Additionally, our framework enhances existing penalized regression methods for variable selection by providing guidance on how to appropriately group variables to achieve the desired selection rule. Furthermore, our innovative framework opens the door to establishing new l0 norm-based penalized regression techniques that can be tailored to respect arbitrary selection rules, thereby expanding the possibilities for more robust and tailored model development.
翻译:在变量选择中,由于候选变量之间固有的结构性约束,制定一套规定允许选择的变量组合的选择规则(称为"选择字典")具有重要意义。实际数据分析中的选择规则可能十分复杂,忽略这些约束不仅会损害模型的可解释性,还可能导致预测精度下降。然而,目前尚无通用框架来形式化选择规则及其应用,这给实践者整合这些规则进行分析带来了重大挑战。本研究建立了一个可纳入通用结构约束的结构化变量选择框架。我们发展了一种构建任意选择规则的数学语言,并对选择字典进行了形式化定义。研究表明所有选择规则都可表示为构造元运算的组合,这有助于识别对应的选择字典。一旦获得该选择字典,实践者即可应用自定义准则选择最优模型。此外,本框架通过指导如何对变量进行恰当分组以实现预期选择规则,增强了现有用于变量选择的惩罚回归方法。同时,这一创新框架为建立新的基于l0范数的惩罚回归技术开辟了道路,这些技术可定制化满足任意选择规则,从而拓展了开发更稳健且更具针对性的模型的可能性。