Feature selection is one of the most relevant processes in any methodology for creating a statistical learning model. Generally, existing algorithms establish some criterion to select the most influential variables, discarding those that do not contribute any relevant information to the model. This methodology makes sense in a classical static situation where the joint distribution of the data does not vary over time. However, when dealing with real data, it is common to encounter the problem of the dataset shift and, specifically, changes in the relationships between variables (concept shift). In this case, the influence of a variable cannot be the only indicator of its quality as a regressor of the model, since the relationship learned in the traning phase may not correspond to the current situation. Thus, we propose a new feature selection methodology for regression problems that takes this fact into account, using Shapley values to study the effect that each variable has on the predictions. Five examples are analysed: four correspond to typical situations where the method matches the state of the art and one example related to electricity price forecasting where a concept shift phenomenon has occurred in the Iberian market. In this case the proposed algorithm improves the results significantly.
翻译:特征选择是构建统计学习模型方法论中最关键的环节之一。现有算法通常建立某种准则来筛选最具影响力的变量,剔除无法为模型提供相关信息的变量。该方法在数据联合分布不随时间变化的经典静态场景中具有合理性。然而在处理真实数据时,常会面临数据集漂移问题,特别是变量间关系的变化(概念漂移)。此时,某个变量的影响力不能成为其作为模型回归变量质量的唯一指标,因为训练阶段习得的关系可能已不符合当前场景。为此,我们提出一种考虑此现象的回归问题特征选择新方法,利用Shapley值研究每个变量对预测的影响。研究分析了五个案例:其中四个对应于该方法达到现有技术水平的典型场景,另一个案例涉及伊比利亚市场出现概念漂移现象的电力价格预测。在该案例中,所提算法显著提升了预测效果。