Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. In many algorithms, a predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an efficient and accurate distance for mixed-type data that utilizes the continuous and discrete properties simulatenously is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity, with cross-validated optimal bandwidth selection. We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric, and improves clustering accuracy when utilized in existing distance-based clustering algorithms on simulated and real-world datasets containing continuous-only, categorical-only, and mixed-type data.
翻译:基于距离的聚类与分类方法在各领域中被广泛应用于对混合数值型与类别型数据进行分组。许多算法采用预定义的距离度量,依据数据点间的相异性进行聚类。尽管存在大量针对纯数值属性数据的距离度量方法,以及若干针对有序和无序类别型数据的度量指标,但如何同时利用连续与离散特性,为混合类型数据构建高效且精确的距离度量仍是一个开放性问题。现有许多度量方法将数值属性转换为类别属性,或反之;它们将数据点视为单一属性类型,或分别计算各属性间的距离后进行加总。本文提出一种名为KDSUM的度量方法,该方法使用混合核函数来度量相异性,并通过交叉验证进行最优带宽选择。我们证明,KDSUM是一种从现有混合类型度量向统一相异性度量的收缩方法,并在模拟和真实数据集(包含纯连续型、纯类别型及混合类型数据)上,将其应用于现有基于距离的聚类算法时,能够提升聚类准确性。