Structural invariants and semantic fingerprints in the "ego network" of words

from arxiv, This work was partially funded by the H2020 SoBigData++ (Grant No 871042), H2020 HumaneAI-Net (Grant No 952026), and CHIST-ERA SAI (Grant No not yet available) projects. arXiv admin note: text overlap with arXiv:2110.06015

Well-established cognitive models coming from anthropology have shown that, due to the cognitive constraints that limit our "bandwidth" for social interactions, humans organize their social relations according to a regular structure. In this work, we postulate that similar regularities can be found in other cognitive processes, such as those involving language production. In order to investigate this claim, we analyse a dataset containing tweets of a heterogeneous group of Twitter users (regular users and professional writers). Leveraging a methodology similar to the one used to uncover the well-established social cognitive constraints, we find regularities at both the structural and semantic level. At the former, we find that a concentric layered structure (which we call ego network of words, in analogy to the ego network of social relationships) very well captures how individuals organise the words they use. The size of the layers in this structure regularly grows (approximately 2-3 times with respect to the previous one) when moving outwards, and the two penultimate external layers consistently account for approximately 60% and 30% of the used words, irrespective of the number of the total number of layers of the user. For the semantic analysis, each ring of each ego network is described by a semantic profile, which captures the topics associated with the words in the ring. We find that ring #1 has a special role in the model. It is semantically the most dissimilar and the most diverse among the rings. We also show that the topics that are important in the innermost ring also have the characteristic of being predominant in each of the other rings, as well as in the entire ego network. In this respect, ring #1 can be seen as the semantic fingerprint of the ego network of words.

翻译：来自人类学的成熟认知模型表明，由于认知限制导致社会互动的“带宽”有限，人类会按照规则结构组织社会关系。本研究假设，其他认知过程中（如语言生成过程）也存在类似的规律性。为验证这一主张，我们分析了包含异质性推特用户群体（普通用户与专业作家）推文的数据集。采用与揭示社会认知约束相同的方法论，我们在结构与语义两个层面发现了规律性。在结构层面，我们发现一种同心分层结构（类比社会关系的自我网络，我们称之为词汇的自我网络）能很好地刻画个体组织其使用词汇的方式。该结构各层的规模向外扩展时呈规律性增长（约为前一层规模的2-3倍），且无论用户总层数如何，倒数第二层和第三层始终分别占据所用词汇的约60%和30%。在语义分析中，每个自我网络的同心环由语义轮廓描述，该轮廓捕捉了环内词汇关联的主题。我们发现第一层环在模型中具有特殊角色——它既是语义差异性最强的环，也是多样性最高的环。我们还表明，最内层环的重要主题同样具有主导其他各环乃至整个自我网络的特征。就此而言，第一层环可被视为词汇自我网络的语义指纹。