In this paper I propose a new way of measuring linguistic productivity that objectively assesses the ability of an affix to be used to coin new complex words and, unlike other popular measures, is not directly dependent upon token frequency. Specifically, I suggest that linguistic productivity may be viewed as the probability of an affix to combine with a random base. The advantages of this approach include the following. First, token frequency does not dominate the productivity measure but naturally influences the sampling of bases. Second, we are not just counting attested word types with an affix but rather simulating the construction of these types and then checking whether they are attested in the corpus. Third, a corpus-based approach and randomised design assure that true neologisms and words coined long ago have equal chances to be selected. The proposed algorithm is evaluated both on English and Russian data. The obtained results provide some valuable insights into the relation of linguistic productivity to the number of types and tokens. It looks like burgeoning linguistic productivity manifests itself in an increasing number of types. However, this process unfolds in two stages: first comes the increase in high-frequency items, and only then follows the increase in low-frequency items.
翻译:本文提出一种测量语言产出性的新方法,该方法能客观评估词缀用于构建新复合词的能力,且不同于其他常用测量方式,不直接依赖于词例频率。具体而言,本文认为语言产出性可视为词缀与随机词基结合的概率。该方法具有以下优势:第一,词例频率不会主导产出性测量,而是自然影响词基的抽样过程;第二,我们不仅统计带有该词缀的已证实词型,而是模拟这些词型的构建过程,再检验其是否在语料库中得到证实;第三,基于语料库的方法与随机化设计确保真正的新词与早期创造的词汇具有同等被选中的概率。所提出的算法同时在英语和俄语数据上进行评估。研究结果为语言产出性与词型数量及词例数量的关系提供了重要洞见:新兴语言产出性似乎表现为词型数量的持续增长,但这一过程分两个阶段展开——首先出现高频词项的增长,随后才出现低频词项的增长。