Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. A predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an optimal distance for mixed-type data is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric that uses mixed kernels to measure dissimilarity, with cross-validated optimal kernel bandwidths. Our approach improves clustering accuracy when utilized for existing distance-based clustering algorithms on simulated and real-world datasets containing pure continuous, categorical, and mixed-type data.
翻译:距离聚类和分类广泛应用于各类领域,用于对混合数值型和分类型数据进行分组。通常采用预定义的距离测量方法,基于数据点之间的差异性进行聚类。尽管纯数值属性数据存在多种距离度量方法,且有序和无序分类型数据也有若干度量标准,但针对混合类型数据的最优距离仍是一个开放性问题。许多度量方法将数值属性转换为分类型属性,或反之亦然。它们将数据点视为单一属性类型处理,或分别计算各属性间的距离再求和。本文提出一种利用混合核函数度量差异性的度量方法,并采用交叉验证确定最优核带宽。将所提度量应用于现有基于距离的聚类算法时,在包含纯连续型、分类型及混合类型数据的模拟数据集和真实数据集中,聚类准确率均得到显著提升。