In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The method is a variant of the Deterministic Information Bottleneck algorithm which optimally compresses the data while retaining relevant information about the underlying structure. We compare the performance of the proposed method to that of three well-established clustering methods (KAMILA, K-Prototypes, and Partitioning Around Medoids with Gower's dissimilarity) on simulated and real-world datasets. The results demonstrate that the proposed approach represents a competitive alternative to conventional clustering techniques under specific conditions.
翻译:本文提出了一种面向混合类型数据聚类的信息论方法,该类型数据同时包含连续变量与分类变量。该方法为确定性信息瓶颈算法的变体,能够在保留底层结构相关信息的同时,对数据进行最优压缩。我们在模拟数据集与真实数据集上,将所提方法与三种成熟聚类方法(KAMILA、K-Prototypes及基于Gower相异度的中心点划分法)的性能进行了对比。结果表明,在特定条件下,所提出的方法可作为传统聚类技术具有竞争力的替代方案。