In many fields, such as language acquisition, neuropsychology of language, the study of aging, and historical linguistics, corpora are used for estimating the diversity of grammatical structures that are produced during a period by an individual, community, or type of speakers. In these cases, treebanks are taken as representative samples of the syntactic structures that might be encountered. Generalizing the potential syntactic diversity from the structures documented in a small corpus requires careful extrapolation whose accuracy is constrained by the limited size of representative sub-corpora. In this article, I demonstrate -- theoretically, and empirically -- that a grammar's derivational entropy and the mean length of the utterances (MLU) it generates are fundamentally linked, giving rise to a new measure, the derivational entropy rate. The mean length of utterances becomes the most practical index of syntactic complexity; I demonstrate that MLU is not a mere proxy, but a fundamental measure of syntactic diversity. In combination with the new derivational entropy rate measure, it provides a theory-free assessment of grammatical complexity. The derivational entropy rate indexes the rate at which different grammatical annotation frameworks determine the grammatical complexity of treebanks. I introduce the Smoothed Induced Treebank Entropy (SITE) as a tool for estimating these measures accurately, even from very small treebanks. I conclude by discussing important implications of these results for both NLP and human language processing.
翻译:在许多领域,如语言习得、语言神经心理学、衰老研究和历史语言学中,研究者常利用语料库来评估个体、群体或特定说话者类型在特定时期内产出的语法结构多样性。在这些研究中,树库被视为可能遇到的句法结构的代表性样本。从小型语料库已记录的结构中推演潜在的句法多样性,需要进行谨慎的外推,其准确性受限于代表性子语料库的有限规模。本文从理论和实证两方面证明:语法的派生熵与其生成话语的平均长度(MLU)存在本质关联,由此催生了一种新的度量指标——派生熵率。平均话语长度成为衡量句法复杂度的最实用指标;本文论证了MLU不仅是替代性指标,更是衡量句法多样性的根本度量。结合新的派生熵率度量,该方法提供了无需理论预设的语法复杂度评估框架。派生熵率可量化不同语法标注框架判定树库语法复杂度的速率差异。本文提出平滑诱导树库熵(SITE)作为估算工具,即使面对极小型树库也能实现精准测量。最后,本文探讨了这些发现对自然语言处理及人类语言处理领域的重要启示。