Lexical diversity measures the vocabulary variation in texts. While its utility is evident for analyses in language change and applied linguistics, it is not yet clear how to operationalize this concept in a unique way. We here investigate entropy and text-token ratio, two widely employed metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a diverse testbed for a quantitative approach to lexical diversity. Strikingly, we find a functional relation between entropy and text-token ratio that holds across the corpora under consideration. Further, in the limit of large vocabularies we find an analytical expression that sheds light on the origin of this relation and its connection with both Zipf and Heaps laws. Our results then contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
翻译:词汇多样性衡量文本中的词汇变化程度。尽管其在语言变迁和应用语言学分析中的实用性显而易见,但目前尚不清楚如何以唯一方式将这一概念操作化。本文在六个包含书籍、新闻文章和推文的大规模英语、西班牙语和土耳其语数据集上,研究了两种广泛使用的词汇多样性度量指标:熵与型例比。这些千兆词级语料库对应着具有不同形态特征的语言,并在语域和体裁上存在差异,从而构成了词汇多样性定量研究的多样化测试平台。引人注目的是,我们发现了熵与型例比之间的函数关系,该关系在所考察的所有语料库中均成立。此外,在大词汇量极限下,我们推导出了一个解析表达式,该表达式揭示了这种关系的起源及其与齐普夫定律和希普斯定律的联系。因此,我们的研究成果不仅促进了对文本结构的理论理解,还为自然语言处理等领域提供了实际应用价值。