Interpreting natural language is an increasingly important task in computer algorithms due to the growing availability of unstructured textual data. Natural Language Processing (NLP) applications rely on semantic networks for structured knowledge representation. The fundamental properties of semantic networks must be taken into account when designing NLP algorithms, yet they remain to be structurally investigated. We study the properties of semantic networks from ConceptNet, defined by 7 semantic relations from 11 different languages. We find that semantic networks have universal basic properties: they are sparse, highly clustered, and exhibit power-law degree distributions. Our findings show that the majority of the considered networks are scale-free. Some networks exhibit language-specific properties determined by grammatical rules, for example networks from highly inflected languages, such as e.g. Latin, German, French and Spanish, show peaks in the degree distribution that deviate from a power law. We find that depending on the semantic relation type and the language, the link formation in semantic networks is guided by different principles. In some networks the connections are similarity-based, while in others the connections are more complementarity-based. Finally, we demonstrate how knowledge of similarity and complementarity in semantic networks can improve NLP algorithms in missing link inference.
翻译:由于非结构化文本数据的日益增多,自然语言理解在计算机算法中成为愈发重要的任务。自然语言处理(NLP)应用依赖语义网络实现结构化知识表示。设计NLP算法时必须考虑语义网络的基本性质,然而这些性质的结构性分析尚不充分。本文研究了源自ConceptNet的语义网络特性,这些网络涵盖11种语言的7种语义关系。研究发现语义网络具有普适的基本性质:稀疏性、高度聚类性以及幂律度分布特征。结果表明,所考察的大多数网络均具有无标度特性。部分网络展现出由语法规则决定的语言特异性,例如拉丁语、德语、法语和西班牙语等高度屈折语言的网络,其度分布存在偏离幂律的峰值现象。研究发现,语义网络中的连边形成取决于语义关系类型和具体语言,遵循不同组织原则:某些网络以相似性为基础建立连接,而另一些网络更侧重互补性关联。最后,我们展示了语义网络中相似性与互补性知识如何改进缺失链接推理任务中的NLP算法。