Predictive power and generalizability of models depend on the quality of features selected in the model. Machine learning (ML) models in banks consider a large number of features which are often correlated or dependent. Incorporation of these features may hinder model stability and prior feature screening can improve long term performance of the models. A Markov boundary (MB) of features is the minimum set of features that guarantee that other potential predictors do not affect the target given the boundary while ensuring maximal predictive accuracy. Identifying the Markov boundary is straightforward under assumptions of Gaussianity on the features and linear relationships between them. This paper outlines common problems associated with identifying the Markov boundary in structured data when relationships are non-linear, and predictors are of mixed data type. We have proposed a multi-group forward-backward selection strategy that not only handles the continuous features but addresses some of the issues with MB identification in a mixed data setup and demonstrated its capabilities on simulated and real datasets.
翻译:模型的预测能力与泛化性取决于所选特征的质量。银行中的机器学习模型通常会考虑大量特征,而这些特征往往存在相关性或依赖性。纳入此类特征可能阻碍模型稳定性,而前期特征筛选则有助于提升模型的长期表现。特征的马尔可夫边界(Markov boundary, MB)是在保证其他潜在预测因子不影响目标变量的前提下(基于该边界条件),实现最大预测精度的最小特征集合。在假设特征服从高斯分布且变量间存在线性关系时,识别马尔可夫边界较为直接。本文阐述了在结构化数据中,当变量关系为非线性且预测因子为混合数据类型时,识别马尔可夫边界的常见问题。我们提出了一种多组前向-后向选择策略,该策略不仅能够处理连续型特征,还能解决混合数据场景下马尔可夫边界识别的部分问题,并通过模拟数据集与真实数据集验证了其能力。