Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.
翻译:大型文本数据集,如学术出版物、网站及其他文本媒介,天然具备两类特征:(1)文本自身通过语义传递的信息;(2)通过链接、引用或共享属性形成的文本间关联。后者可描述为图结构,并可通过一系列成熟的分类与预测算法进行处理;而前者则借助LLM嵌入模型的应用展现出新的潜力。为验证这些可能性及其可行性,我们基于提出的嵌入方法对包含约5600万篇科学文献的Web of Science数据集进行探究,揭示了文本内在的自组织结构景观。