Feature selection is one of the most relevant processes in any methodology for creating a statistical learning model. Usually, existing algorithms establish some criterion to select the most influential variables, discarding those that do not contribute to the model with any relevant information. This methodology makes sense in a static situation where the joint distribution of the data does not vary over time. However, when dealing with real data, it is common to encounter the problem of the dataset shift and, specifically, changes in the relationships between variables (concept shift). In this case, the influence of a variable cannot be the only indicator of its quality as a regressor of the model, since the relationship learned in the training phase may not correspond to the current situation. In tackling this problem, our approach establishes a direct relationship between the Shapley values and prediction errors, operating at a more local level to effectively detect the individual biases introduced by each variable. The proposed methodology is evaluated through various examples, including synthetic scenarios mimicking sudden and incremental shift situations, as well as two real-world cases characterized by concept shifts. Additionally, we perform three analyses of standard situations to assess the algorithm's robustness in the absence of shifts. The results demonstrate that our proposed algorithm significantly outperforms state-of-the-art feature selection methods in concept shift scenarios, while matching the performance of existing methodologies in static situations.
翻译:特征选择是构建统计学习模型的所有方法论中最为关键的环节之一。现有算法通常建立某种准则来选择最具影响力的变量,剔除那些未能为模型贡献相关信息的变量。这种方法在数据联合分布不随时间变化的静态场景下具有合理性。然而,处理真实数据时常面临数据集漂移问题,尤其是变量间关系的变化(概念漂移)。在此情况下,变量的影响力不足以作为其作为模型回归变量质量的唯一指标,因为训练阶段习得的关系可能已不适用于当前情境。为解决该问题,本文方法在Shapley值与预测误差之间建立直接关联,通过更局部的分析有效检测每个变量引入的个体偏差。我们通过多种示例对所提方法论进行评估,包括模拟突发性和渐进性漂移场景的合成数据,以及两个具有概念漂移特征的现实案例。此外,我们还在无漂移条件下进行三项标准情景分析以评估算法的鲁棒性。结果表明,所提算法在概念漂移场景中显著优于现有最优特征选择方法,而在静态场景下其性能与现有方法相当。