In text analysis, Spherical K-means (SKM) is a specialized k-means clustering algorithm widely utilized for grouping documents represented in high-dimensional, sparse term-document matrices, often normalized using techniques like TF-IDF. Researchers frequently seek to cluster not only documents but also the terms associated with them into coherent groups. To address this dual clustering requirement, we introduce Spherical Double K-Means (SDKM), a novel methodology that simultaneously clusters documents and terms. This approach offers several advantages: first, by integrating the clustering of documents and terms, SDKM provides deeper insights into the relationships between content and vocabulary, enabling more effective topic identification and keyword extraction. Additionally, the two-level clustering assists in understanding both overarching themes and specific terminologies within document clusters, enhancing interpretability. SDKM effectively handles the high dimensionality and sparsity inherent in text data by utilizing cosine similarity, leading to improved computational efficiency. Moreover, the method captures dynamic changes in thematic content over time, making it well-suited for applications in rapidly evolving fields. Ultimately, SDKM presents a comprehensive framework for advancing text mining efforts, facilitating the uncovering of nuanced patterns and structures that are critical for robust data analysis. We apply SDKM to the corpus of US presidential inaugural addresses, spanning from George Washington in 1789 to Joe Biden in 2021. Our analysis reveals distinct clusters of words and documents that correspond to significant historical themes and periods, showcasing the method's ability to facilitate a deeper understanding of the data. Our findings demonstrate the efficacy of SDKM in uncovering underlying patterns in textual data.
翻译:在文本分析中,球形K均值是一种专门化的k均值聚类算法,广泛应用于对高维稀疏词项-文档矩阵表示的文档进行分组,此类矩阵常采用TF-IDF等技术进行归一化处理。研究者不仅常需对文档进行聚类,也希望对相关词项进行连贯分组。为满足这一双重聚类需求,本文提出球形双重K均值——一种能同时对文档和词项进行聚类的新方法。该方法具有多重优势:首先,通过整合文档与词项的聚类过程,SDKM能更深入地揭示内容与词汇间的关联,从而实现更有效的主题识别与关键词提取。其次,双层聚类机制有助于同时理解文档簇的宏观主题与具体术语,增强了结果的可解释性。SDKM利用余弦相似度有效处理文本数据固有的高维性与稀疏性,从而提升了计算效率。此外,该方法能捕捉主题内容随时间推移的动态变化,使其特别适用于快速演进领域的研究应用。最终,SDKM为推进文本挖掘工作提供了一个综合性框架,有助于揭示对稳健数据分析至关重要的细微模式与结构。我们将SDKM应用于从1789年乔治·华盛顿至2021年乔·拜登的美国总统就职演说语料库。分析结果显示,词项与文档形成的聚类簇对应着重要的历史主题与时期,印证了该方法在深化数据理解方面的能力。研究结果证明了SDKM在揭示文本数据潜在模式方面的有效性。