Recent advances in large language models enable documents to be represented as dense semantic embeddings, supporting similarity-based operations over large text collections. However, many web-scale systems still rely on flat clustering or predefined taxonomies, limiting insight into hierarchical topic relationships. In this paper we operationalize hierarchical density modeling on large language model embeddings in a way not previously explored. Instead of enforcing a fixed taxonomy or single clustering resolution, the method progressively relaxes local density constraints, revealing how compact semantic groups merge into broader thematic regions. The resulting tree encodes multi-scale semantic organization directly from data, making structural relationships between topics explicit. We evaluate the hierarchies on standard text benchmarks, showing that semantic alignment peaks at intermediate density levels and that abrupt transitions correspond to meaningful changes in semantic resolution. Beyond benchmarks, the approach is applied to large institutional and scientific corpora, exposing dominant fields, cross-disciplinary proximities, and emerging thematic clusters. By framing hierarchical structure as an emergent property of density in embedding spaces, this method provides an interpretable, multi-scale representation of semantic structure suitable for large, evolving text collections.
翻译:近年来,大型语言模型的发展使得文档能够表示为稠密语义嵌入,从而支持对大规模文本集合进行基于相似性的操作。然而,许多网络级系统仍依赖于扁平聚类或预定义分类法,这限制了对层次化主题关系的深入洞察。本文以前所未有的方式,在大型语言模型嵌入上实现了层次化密度建模。该方法不强制使用固定分类法或单一聚类分辨率,而是逐步放宽局部密度约束,从而揭示紧凑的语义群如何逐步融合为更广泛的主题区域。由此生成的树结构直接从数据中编码了多尺度语义组织,使主题间的结构关系得以显式呈现。我们在标准文本基准上评估了所生成的层次结构,结果表明语义对齐在中等密度水平达到峰值,且急剧的过渡对应于语义分辨率的意义变化。除基准测试外,该方法被应用于大型机构与科学语料库,揭示了主导领域、跨学科邻近性以及新兴主题簇。通过将层次结构框架化为嵌入空间中密度的涌现特性,本方法提供了一种适用于大规模、动态文本集合的可解释、多尺度语义结构表征。