In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The proposed approach extends the Information Bottleneck principle to heterogeneous data through generalised product kernels, integrating continuous, nominal, and ordinal variables within a unified optimization framework. We address the following challenges: developing a systematic bandwidth selection strategy that equalises contributions across variable types, and proposing an adaptive hyperparameter updating scheme that ensures a valid solution into a predetermined number of potentially imbalanced clusters. Through simulations on 28,800 synthetic data sets and ten publicly available benchmarks, we demonstrate that the proposed method, named DIBmix, achieves superior performance compared to four established methods (KAMILA, K-Prototypes, FAMD with K-Means, and PAM with Gower's dissimilarity). Results show DIBmix particularly excels when clusters exhibit size imbalances, data contain low or moderate cluster overlap, and categorical and continuous variables are equally represented. The method presents a significant advantage over traditional centroid-based algorithms, establishing DIBmix as a competitive and theoretically grounded alternative for mixed-type data clustering.
翻译:本文提出了一种基于信息论的混合类型数据聚类方法,即同时处理连续变量与分类变量的数据。该方法通过广义乘积核将信息瓶颈原理扩展至异构数据,在统一的优化框架中整合连续变量、名义变量与有序变量。我们解决了以下挑战:开发一种系统化的带宽选择策略以平衡不同变量类型的贡献度,并提出一种自适应超参数更新方案,确保算法能够收敛至预定数量的潜在不平衡聚类簇。通过对28,800个合成数据集及十个公开基准数据集的仿真实验,我们证明所提出的DIBmix方法相较于四种经典方法(KAMILA、K-Prototypes、FAMD结合K-Means以及采用Gower相异度的PAM)具有更优越的性能。结果表明,DIBmix在以下场景表现尤为突出:聚类簇存在规模不平衡、数据具有低度或中度簇间重叠、且分类变量与连续变量具有同等表征力时。该方法相较于传统的基于质心的算法展现出显著优势,确立了DIBmix作为混合类型数据聚类领域中兼具理论依据与竞争力的解决方案。