The review summarizes the main methodological concepts used in studying natural language from the perspective of complexity science and documents their applicability in identifying both universal and system-specific features of language in its written representation. Three main complexity-related research trends in quantitative linguistics are covered. The first part addresses the issue of word frequencies in texts and demonstrates that taking punctuation into consideration restores scaling whose violation in the Zipf's law is often observed for the most frequent words. The second part introduces methods inspired by time series analysis, used in studying various kinds of correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems, like long-range correlations or (multi)fractal structures. Moreover, it appears that the distances between punctuation marks comply with the discrete variant of the Weibull distribution. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of the so-called word-adjacency networks. Parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied to represent the organization of word associations. Structure of word-association networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation seems to have a significant impact not only on the language's information-carrying ability but also on its key statistical properties, hence it is recommended to consider punctuation marks on a par with words.
翻译:本文综述了从复杂性科学角度研究自然语言时使用的主要方法论概念,并论证了这些概念在识别书面语言中通用特征与系统特异性特征方面的适用性。研究覆盖了计量语言学中三个与复杂性相关的主流方向。第一部分探讨文本中的词频问题,证明将标点符号纳入考量后,能够恢复词频分布的标度律——而齐普夫定律中常观察到的最高频词标度律违背现象也将随之消失。第二部分引入受时间序列分析启发的方法,用于研究书面文本中的各类相关性。相关时间序列基于文本按句子或连续标点间短语的分割生成。结果显示,这些序列呈现出复杂系统信号中常见的特征,如长程相关性或(多重)分形结构。此外,标点符号间距符合威布尔分布的离散变体。第三部分回顾了网络形式主义在自然语言中的应用,特别是所谓的单词邻接网络。此类网络的拓扑参数可用于文本分类,例如从风格计量学角度进行分类。网络方法还可用于表征词语联想组织,其结构显著不同于随机网络,揭示了语言的本质属性。最后,标点符号不仅对语言的信息承载能力具有重要影响,还显著影响其关键统计特性,因此建议将标点符号与词汇置于同等地位进行研究。