Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited - 专知论文

会员服务 ·

0

有向 · 优化器 · 可约的 · 相互独立的 · Principle ·

2023 年 3 月 17 日

Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited

翻译：词汇长度的压缩之直接与间接证据：齐普夫缩写定律再探

Sonia Petrini,Antoni Casas-i-Muñoz,Jordi Cluet-i-Martinell,Mengxue Wang,Chris Bentz,Ramon Ferrer-i-Cancho

from arxiv, arXiv admin note: substantial text overlap with arXiv:2208.10384

Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal, in the sense that it has the potential for being exceptionless or with a number of exceptions that is vanishingly small compared to the number of languages on Earth. Since Zipf's pioneering research, this law has been viewed as a manifestation of a universal principle of communication, i.e. the minimization of word lengths, to reduce the effort of communication. Here we revisit the concordance of written language with the law of abbreviation. Crucially, we provide wider evidence that the law holds also in speech (when word length is measured in time), in particular in 46 languages from 14 linguistic families. Agreement with the law of abbreviation provides indirect evidence of compression of languages via the theoretical argument that the law of abbreviation is a prediction of optimal coding. Motivated by the need of direct evidence of compression, we derive a simple formula for a random baseline indicating that word lengths are systematically below chance, across linguistic families and writing systems, and independently of the unit of measurement (length in characters or duration in time). Our work paves the way to measure and compare the degree of optimality of word lengths in languages.

翻译：齐普夫缩写定律，即高频词倾向于更短的现象，是语言普遍性最可靠的候选之一，因其具有无例外或例外数量相对于地球语言数量而言微乎其微的潜力。自齐普夫开创性研究以来，该定律被视为通信普遍原理（即最小化词长以降低通信成本）的体现。本文重新审视书面语言与缩写定律的一致性。关键在于，我们提供了更广泛的证据表明该定律在口语中（当词长以时间度量时）同样成立，尤其涵盖14个语系的46种语言。通过理论论证——缩写定律是最优编码的预测结果，与缩写定律的一致性为语言压缩提供了间接证据。受直接证据需求的驱动，我们推导出一个简单的随机基线公式，表明词长系统性低于随机水平，该现象跨越语系与文字系统，且独立于测量单位（字符长度或时间持续）。我们的工作为衡量与比较语言词长最优性程度铺平了道路。

0

相关内容

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

神经网络与形式语言综述，12页pdf，A Survey of Neural Networks and Formal Languages

神经网络与形式语言综述，12页pdf，A Survey of Neural Networks and Formal Languages

专知会员服务

21+阅读 · 2020年6月4日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

贝叶斯网络在医疗的应用综述：过去，现在和未来 | A Comprehensive Scoping Review of Bayesian Networks in Healthcare: Past, Present and Future

贝叶斯网络在医疗的应用综述：过去，现在和未来 | A Comprehensive Scoping Review of Bayesian Networks in Healthcare: Past, Present and Future

专知会员服务

41+阅读 · 2020年2月26日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

BERT 瘦身之路：Distillation，Quantization，Pruning

BERT 瘦身之路：Distillation，Quantization，Pruning

AINLP

10+阅读 · 2019年10月22日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Nature 一周论文导读 | 2018 年 3 月 29 日

Nature 一周论文导读 | 2018 年 3 月 29 日

科研圈

12+阅读 · 2018年4月7日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

WTX通过ARHGDIA/CDC42/PAKs调控细胞骨架稳定性抑制结直肠癌肝转移机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

鞘氨醇代谢通路在早期胚胎转运和发育及输卵管妊娠发生中的作用

国家自然科学基金

0+阅读 · 2013年12月31日

基于ePSF的空间碎片高精度位置测量研究

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥MED25互作蛋白MIP1调控茉莉酸信号途径的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

基于HIF-1α信号途径研究硫化氢对缺氧诱导Aβ生成和聚积的抑制作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

藤黄酸抗B细胞非霍奇金淋巴瘤新机制- - 调控SRC-3/组蛋白乙酰化转录复合物SUMO化修饰

国家自然科学基金

0+阅读 · 2012年12月31日

RANK-钙离子ATP酶新机制阻止足细胞损伤的研究

国家自然科学基金

0+阅读 · 2012年12月31日

组合序列的实零点性和对数凸性研究

国家自然科学基金

0+阅读 · 2011年12月31日

线粒体钙离子参与疼痛与镇痛中枢机制的作用研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于毒损脑络病机的阿尔茨海默病治疗方药的分子靶点研究

国家自然科学基金

0+阅读 · 2010年12月31日

Investigating the effect of sub-word segmentation on the performance of transformer language models

Arxiv

0+阅读 · 2023年5月9日

Consistent Text Categorization using Data Augmentation in e-Commerce

Arxiv

0+阅读 · 2023年5月9日

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

Arxiv

0+阅读 · 2023年5月9日

Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods

Arxiv

0+阅读 · 2023年5月8日

Rate-Distortion Theory for Mixed States

Arxiv

0+阅读 · 2023年5月7日

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Arxiv

0+阅读 · 2023年5月7日

Minimum-Membership Geometric Set Cover, Revisited

Arxiv

0+阅读 · 2023年5月6日

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

Arxiv

0+阅读 · 2023年5月5日

Forecasting: theory and practice

Arxiv

57+阅读 · 2022年1月5日

A Survey of Quantization Methods for Efficient Neural Network Inference

Arxiv

22+阅读 · 2021年6月21日

VIP会员

文章信息

相关主题

相互独立的

最新内容

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

专知会员服务

1+阅读 · 今天14:45

综述 | 世界动作模型：少做梦，多行动

综述 | 世界动作模型：少做梦，多行动

专知会员服务

1+阅读 · 今天14:43

美以伊冲突：无人机与人工智能的运用

美以伊冲突：无人机与人工智能的运用

专知会员服务

3+阅读 · 今天14:31

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

专知会员服务

3+阅读 · 今天14:20

《特种部队在透明战场中的生存力》最新报告

《特种部队在透明战场中的生存力》最新报告

专知会员服务

2+阅读 · 今天14:11

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

专知会员服务

3+阅读 · 今天14:07

《人工智能生成的零日漏洞：对未来作战的影响》

《人工智能生成的零日漏洞：对未来作战的影响》

专知会员服务

3+阅读 · 今天14:03

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

专知会员服务

2+阅读 · 今天13:59

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

专知会员服务

5+阅读 · 6月22日

综述 | 3D场景图：开放挑战与未来方向

综述 | 3D场景图：开放挑战与未来方向

专知会员服务

8+阅读 · 6月22日

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

专知会员服务

7+阅读 · 6月22日

21世纪的无人机战争

21世纪的无人机战争

专知会员服务

4+阅读 · 6月22日

《伊朗与以色列-美国热战及其对数字技术的影响》

《伊朗与以色列-美国热战及其对数字技术的影响》

专知会员服务

5+阅读 · 6月22日

《量子技术的军事任务技术适配与利用》

《量子技术的军事任务技术适配与利用》

专知会员服务

5+阅读 · 6月22日

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

专知会员服务

8+阅读 · 6月22日

相关VIP内容

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

神经网络与形式语言综述，12页pdf，A Survey of Neural Networks and Formal Languages

神经网络与形式语言综述，12页pdf，A Survey of Neural Networks and Formal Languages

专知会员服务

21+阅读 · 2020年6月4日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

贝叶斯网络在医疗的应用综述：过去，现在和未来 | A Comprehensive Scoping Review of Bayesian Networks in Healthcare: Past, Present and Future

贝叶斯网络在医疗的应用综述：过去，现在和未来 | A Comprehensive Scoping Review of Bayesian Networks in Healthcare: Past, Present and Future

专知会员服务

41+阅读 · 2020年2月26日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 世界动作模型：少做梦，多行动

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

美以伊冲突：无人机与人工智能的运用

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

BERT 瘦身之路：Distillation，Quantization，Pruning

BERT 瘦身之路：Distillation，Quantization，Pruning

AINLP

10+阅读 · 2019年10月22日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Nature 一周论文导读 | 2018 年 3 月 29 日

Nature 一周论文导读 | 2018 年 3 月 29 日

科研圈

12+阅读 · 2018年4月7日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Investigating the effect of sub-word segmentation on the performance of transformer language models

Arxiv

0+阅读 · 2023年5月9日

Consistent Text Categorization using Data Augmentation in e-Commerce

Arxiv

0+阅读 · 2023年5月9日

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

Arxiv

0+阅读 · 2023年5月9日

Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods

Arxiv

0+阅读 · 2023年5月8日

Rate-Distortion Theory for Mixed States

Arxiv

0+阅读 · 2023年5月7日

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Arxiv

0+阅读 · 2023年5月7日

Minimum-Membership Geometric Set Cover, Revisited

Arxiv

0+阅读 · 2023年5月6日

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

Arxiv

0+阅读 · 2023年5月5日

Forecasting: theory and practice

Arxiv

57+阅读 · 2022年1月5日

A Survey of Quantization Methods for Efficient Neural Network Inference

Arxiv

22+阅读 · 2021年6月21日

相关基金

WTX通过ARHGDIA/CDC42/PAKs调控细胞骨架稳定性抑制结直肠癌肝转移机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

鞘氨醇代谢通路在早期胚胎转运和发育及输卵管妊娠发生中的作用

国家自然科学基金

0+阅读 · 2013年12月31日

基于ePSF的空间碎片高精度位置测量研究

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥MED25互作蛋白MIP1调控茉莉酸信号途径的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

基于HIF-1α信号途径研究硫化氢对缺氧诱导Aβ生成和聚积的抑制作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

藤黄酸抗B细胞非霍奇金淋巴瘤新机制- - 调控SRC-3/组蛋白乙酰化转录复合物SUMO化修饰

国家自然科学基金

0+阅读 · 2012年12月31日

RANK-钙离子ATP酶新机制阻止足细胞损伤的研究

国家自然科学基金

0+阅读 · 2012年12月31日

组合序列的实零点性和对数凸性研究

国家自然科学基金

0+阅读 · 2011年12月31日

线粒体钙离子参与疼痛与镇痛中枢机制的作用研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于毒损脑络病机的阿尔茨海默病治疗方药的分子靶点研究

国家自然科学基金

0+阅读 · 2010年12月31日

微信扫码咨询专知VIP会员