The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the proportion of hapaxes, i.e., words that occur once. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the hapax rate is a simple function of the text length. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.
翻译:本文基于单次词(即仅出现一次的词汇)比例的系统性模型,对Zipf定律与Heap定律进行了修正。推导过程基于两个假设:第一个假设是标准瓮模型,该模型预测较短文本的边缘频率分布,仿佛词符是从给定较长文本中盲目采样得到的。第二个假设认为单次词比率是文本长度的简单函数。本文讨论了四种此类函数:常数模型、Davis模型、线性模型及Logistic模型。研究表明,Logistic模型拟合效果最佳。