In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The proposed approach is built on the deterministic variant of the Information Bottleneck algorithm, designed to optimally compress data while preserving its relevant structural information. We evaluate the performance of our method against four well-established clustering techniques for mixed-type data -- KAMILA, K-Prototypes, Factor Analysis for Mixed Data with K-Means, and Partitioning Around Medoids using Gower's dissimilarity -- using both simulated and real-world datasets. The results highlight that the proposed approach offers a competitive alternative to traditional clustering techniques, particularly under specific conditions where heterogeneity in data poses significant challenges.
翻译:本文提出了一种基于信息论的混合类型数据聚类方法,即同时处理连续变量和分类变量的数据。该方法建立在确定性信息瓶颈算法的变体基础上,旨在最优压缩数据的同时保留其相关结构信息。我们通过模拟数据集和真实数据集,将所提方法与四种成熟的混合类型数据聚类技术——KAMILA、K-Prototypes、基于因子分析的K均值混合数据聚类方法以及使用Gower相异度的围绕中心点划分法——进行了性能比较。结果表明,所提方法为传统聚类技术提供了具有竞争力的替代方案,尤其在数据异质性带来显著挑战的特定条件下表现突出。