Nearest-neighbor methods have become popular in statistics and play a key role in statistical learning. Important decisions in nearest-neighbor methods concern the variables to use (when many potential candidates exist) and how to measure the dissimilarity between units. The first decision depends on the scope of the application while second depends mainly on the type of variables. Unfortunately, relatively few options permit to handle mixed-type variables, a situation frequently encountered in practical applications. The most popular dissimilarity for mixed-type variables is derived as the complement to one of the Gower's similarity coefficient. It is appealing because ranges between 0 and 1, being an average of the scaled dissimilarities calculated variable by variable, handles missing values and allows for a user-defined weighting scheme when averaging dissimilarities. The discussion on the weighting schemes is sometimes misleading since it often ignores that the unweighted "standard" setting hides an unbalanced contribution of the single variables to the overall dissimilarity. We address this drawback following the recent idea of introducing a weighting scheme that minimizes the differences in the correlation between each contributing dissimilarity and the resulting weighted Gower's dissimilarity. In particular, this note proposes different approaches for measuring the correlation depending on the type of variables. The performances of the proposed approaches are evaluated in simulation studies related to classification and imputation of missing values.
翻译:最近邻方法在统计学中日益普及,并在统计学习中发挥关键作用。最近邻方法的重要决策涉及变量的选择(当存在众多候选变量时)以及如何测量单元间的相异性。前者取决于应用场景,后者则主要取决于变量类型。遗憾的是,能够处理混合类型变量的方法相对较少,而这类情况在实际应用中频繁出现。最常用的混合类型变量相异性度量源自Gower相似性系数的补数(即1减去该系数)。该方法的优势在于其取值范围为0到1,通过逐变量计算缩放后的相异性并取均值得到,能够处理缺失值,且允许在平均相异性时引入用户自定义的加权方案。然而,关于加权方案的讨论有时会产生误导,因为未加权"标准"设置常会掩盖各单一变量对整体相异性的不平衡贡献。针对这一缺陷,我们借鉴近期提出的新思路,引入一种加权方案,旨在最小化各分项相异性与最终加权Gower相异性之间相关性的差异。特别地,本文针对不同变量类型提出了相应的相关性度量方法。通过分类与缺失值插补的模拟研究,评估了所提方法的性能。