Interpreting natural language is an increasingly important task in computer algorithms due to the growing availability of unstructured textual data. Natural Language Processing (NLP) applications rely on semantic networks for structured knowledge representation. The fundamental properties of semantic networks must be taken into account when designing NLP algorithms, yet they remain to be structurally investigated. We study the properties of semantic networks from ConceptNet, defined by 7 semantic relations from 11 different languages. We find that semantic networks have universal basic properties: they are sparse, highly clustered, and many exhibit power-law degree distributions. Our findings show that the majority of the considered networks are scale-free. Some networks exhibit language-specific properties determined by grammatical rules, for example networks from highly inflected languages, such as e.g. Latin, German, French and Spanish, show peaks in the degree distribution that deviate from a power law. We find that depending on the semantic relation type and the language, the link formation in semantic networks is guided by different principles. In some networks the connections are similarity-based, while in others the connections are more complementarity-based. Finally, we demonstrate how knowledge of similarity and complementarity in semantic networks can improve NLP algorithms in missing link inference.
翻译:随着非结构化文本数据日益丰富,自然语言处理在计算机算法中扮演着越来越重要的角色。自然语言处理应用依赖语义网络实现结构化知识表示。在设计NLP算法时必须考虑语义网络的基本属性,然而这些属性的结构性研究尚不充分。本研究针对ConceptNet中由11种语言的7种语义关系定义的语义网络展开分析。我们发现语义网络具有普适的基本特征:稀疏性、高度聚类性,且多数网络呈现幂律度分布。研究结果表明,大多数被考察的网络具有无标度特性。部分网络表现出由语法规则决定的语言特异性特征,例如拉丁语、德语、法语和西班牙语等高度屈折语言的网络,其度分布会出现偏离幂律分布的峰值。研究发现,根据语义关系类型和语言的不同,语义网络中的连接形成受不同原则支配:某些网络中的连接基于相似性,而另一些网络则更强调互补性。最后,我们论证了利用语义网络中的相似性与互补性知识,可有效改进缺失链接推理中的NLP算法性能。