In text analysis, Spherical K-means (SKM) is a specialized k-means clustering algorithm widely utilized for grouping documents represented in high-dimensional, sparse term-document matrices, often normalized using techniques like TF-IDF. Researchers frequently seek to cluster not only documents but also the terms associated with them into coherent groups. To address this dual clustering requirement, we introduce Spherical Double K-Means (SDKM), a novel methodology that simultaneously clusters documents and terms. This approach offers several advantages: first, by integrating the clustering of documents and terms, SDKM provides deeper insights into the relationships between content and vocabulary, enabling more effective topic identification and keyword extraction. Additionally, the two-level clustering assists in understanding both overarching themes and specific terminologies within document clusters, enhancing interpretability. SDKM effectively handles the high dimensionality and sparsity inherent in text data by utilizing cosine similarity, leading to improved computational efficiency. Moreover, the method captures dynamic changes in thematic content over time, making it well-suited for applications in rapidly evolving fields. Ultimately, SDKM presents a comprehensive framework for advancing text mining efforts, facilitating the uncovering of nuanced patterns and structures that are critical for robust data analysis. We apply SDKM to the corpus of US presidential inaugural addresses, spanning from George Washington in 1789 to Joe Biden in 2021. Our analysis reveals distinct clusters of words and documents that correspond to significant historical themes and periods, showcasing the method's ability to facilitate a deeper understanding of the data. Our findings demonstrate the efficacy of SDKM in uncovering underlying patterns in textual data.
翻译:在文本分析中,球形K均值(SKM)是一种专门化的k均值聚类算法,广泛应用于对以高维稀疏词项-文档矩阵(常通过TF-IDF等技术进行归一化)表示的文档进行分组。研究者不仅常需对文档进行聚类,也希望对与之关联的词项进行连贯分组。为满足这一双重聚类需求,本文提出球形双重K均值(SDKM)——一种能同时聚类文档与词项的新方法。该方法具有多重优势:首先,通过整合文档与词项的聚类,SDKM能更深入地揭示内容与词汇间的关联,从而实现更有效的主题识别与关键词提取。其次,双层聚类有助于理解文档簇内的宏观主题与具体术语,增强了结果的可解释性。SDKM通过利用余弦相似度有效处理文本数据固有的高维性与稀疏性,从而提升了计算效率。此外,该方法能捕捉主题内容随时间推移的动态变化,使其特别适用于快速演变领域的研究应用。最终,SDKM为推进文本挖掘工作提供了一个综合框架,有助于揭示对稳健数据分析至关重要的细微模式与结构。我们将SDKM应用于从1789年乔治·华盛顿至2021年乔·拜登的美国总统就职演说语料库。分析结果显示,词项与文档形成的明显聚类对应于重要的历史主题与时期,证明了该方法能促进对数据的深层理解。我们的研究结果验证了SDKM在揭示文本数据潜在模式方面的有效性。