Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic-aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM-enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19-27%. Code is available at https://github.com/develop-yang/ARISE
翻译:分类数据在医疗、营销和生物信息学等领域普遍存在,聚类作为模式发现的基础工具在其中具有重要应用。分类数据聚类的核心挑战在于衡量缺乏内在排序或距离的属性值之间的相似性。若缺乏合适的相似性度量,这些值往往被视为等距,从而产生语义鸿沟,掩盖数据潜在结构并降低聚类质量。尽管现有方法通过数据内部共现模式推断值间关系,但当样本有限时,此类推断的可靠性不足,导致数据的语义上下文未被充分挖掘。为弥合这一鸿沟,我们提出ARISE(基于注意力加权与集成语义嵌入的表示方法),该方法利用大语言模型(LLM)的外部语义知识构建语义感知表示,以补充分类数据的度量空间,从而实现精准聚类。具体而言,采用LLM描述属性值以增强表示,并将LLM增强的嵌入与原始数据结合,探索语义显著的聚类结构。在八个基准数据集上的实验表明,该方法相较于七种代表性方法实现了一致性提升,聚类指标增益达19-27%。代码已开源:https://github.com/develop-yang/ARISE