Zipf's law of abbreviation, namely the tendency of more frequent words to be shorter, has been viewed as a manifestation of compression, i.e. the minimization of the length of forms -- a universal principle of natural communication. Although the claim that languages are optimized has become trendy, attempts to measure the degree of optimization of languages have been rather scarce. Here we present two optimality scores that are dualy normalized, namely, they are normalized with respect to both the minimum and the random baseline. We analyze the theoretical and statistical advantages and disadvantages of these and other scores. Harnessing the best score, we quantify the degree of optimality of word lengths per language. This includes parallel texts in 20 languages of 9 families, written in 8 scripts, as well as spoken data for 46 languages of 12 families, two constructed languages, and one isolate. Our analyses indicate that languages are optimized to 62 or 67 percent on average (depending on the source) when word lengths are measured in characters, and to 65 percent on average when word lengths are measured in time. In general, spoken word durations are more optimized than written word lengths in characters. Our work paves the way to measure the degree of optimality of the vocalizations or gestures of other species, and to compare them against written, spoken, or signed human languages.
翻译:齐普夫缩略定律,即高频词倾向于更短的现象,被视为压缩(即最小化形式长度——自然交际的普遍原则)的表现。尽管语言经过优化的观点已成潮流,但衡量语言优化程度的尝试仍相当稀少。本文提出两种双重归一化的优化得分,即同时基于极值基线和随机基线进行归一化。我们分析了这些得分及其他得分的理论与统计优势与不足。利用最优得分,我们量化了每种语言词长的优化程度,涵盖9个语系20种语言(使用8种文字系统)的平行文本,以及12个语系46种语言的口语数据、两种人造语言和一种孤立语。分析表明:当以字符数测量词长时,语言平均优化程度达62%或67%(取决于数据源);以时长测量时则平均达65%。总体而言,口语词长比书写词长的字符数优化程度更高。本研究为测量其他物种发声或手势的优化程度,并将其与人类书面语、口语及手语比较奠定了基础。