Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. In many algorithms, a predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an efficient and accurate distance for mixed-type data that utilizes the continuous and discrete properties simulatenously is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity, with cross-validated optimal bandwidth selection. We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric, and improves clustering accuracy when utilized in existing distance-based clustering algorithms on simulated and real-world datasets containing continuous-only, categorical-only, and mixed-type data.
翻译:基于距离的聚类和分类方法广泛应用于各类混合数值型与类别型数据的分组问题。许多算法采用预定义的距离度量,依据数据点间的差异性进行聚类。尽管针对纯数值型数据及有序/无序类别型数据已有多种距离度量方法,但如何利用连续性与离散性双重属性,为混合类型数据构建高效准确的距离度量仍是一个开放性问题。现有方法或将数值属性转化为类别属性、或将类别属性转化为数值属性,将数据点视为单一属性类型处理,或分别计算各属性距离后进行加总。本文提出名为KDSUM的度量方法,采用混合核函数测量数据差异性,并通过交叉验证实现最优带宽选择。我们证明KDSUM是现有混合类型度量向统一差异性度量的一种收缩方法,应用于现有基于距离的聚类算法时,能够提升对纯连续型、纯类别型以及混合类型模拟数据集与真实数据集的聚类精度。