An index of an effective number of variables (ENV) is introduced for model selection in nested models. This is the case, for instance, when we have to decide the order of a polynomial function or the number of bases in a nonlinear regression, choose the number of clusters in a clustering problem, or the number of features in a variable selection application (to name few examples). It is inspired by the idea of the maximum area under the curve (AUC). The interpretation of the ENV index is identical to the effective sample size (ESS) indices concerning a set of samples. The ENV index improves {drawbacks of} the elbow detectors described in the literature and introduces different confidence measures of the proposed solution. These novel measures can be also employed jointly with the use of different information criteria, such as the well-known AIC and BIC, or any other model selection procedures. Comparisons with classical and recent schemes are provided in different experiments involving real datasets. Related Matlab code is given.
翻译:本文针对嵌套模型选择问题,引入了一种有效变量数(ENV)指标。该指标适用于多项式函数阶数确定、非线性回归基函数数量选择、聚类问题中聚类数确定以及变量选择应用中特征数选取等典型场景(仅举数例)。其设计灵感来源于曲线下最大面积(AUC)的思想。ENV指标的解释方式与针对样本集的有效样本量(ESS)指标完全相同。该指标改进了文献中描述的肘部检测法的缺陷,并为所提解决方案引入了不同的置信度度量。这些新颖的度量方法还可与不同信息准则(如经典的AIC和BIC)或其他任何模型选择程序结合使用。通过涉及真实数据集的多组实验,与经典及最新方案进行了对比分析。文末提供了相关的Matlab代码。