Measuring language complexity from hierarchical reuse of recurring patterns

We introduce the ladderpath index as a measure of language complexity grounded in algorithmic information theory. It counts the minimum steps needed to reconstruct a sequence through hierarchical reuse of repeated substructures, capturing an exactly computable but constrained form of algorithmic compressibility related to, but distinct from, Kolmogorov complexity. We apply the ladderpath approach to 21 parallel corpora from the Parallel Universal Dependencies dataset. The ladderpath index is approximately invariant across the languages, and varies much less than the corpus length. This is more pronounced when all corpora are mapped to a unified binary representation, providing evidence for the equi-complexity hypothesis from a representation-independent perspective. We also observe trade-offs between character inventory size and corpus length, and between vocabulary-level and corpus-level reconstruction complexity, supporting the trade-off hypothesis that total complexity is conserved and redistributed across linguistic levels. The reusable substructures identified by the ladderpath approach, without any linguistic input, overlap with words and morphological components attested in the natural vocabulary. The hierarchical reuse captured by the ladderpath approach parallels the chunking mechanisms proposed in cognitive science, where the human cognitive system compresses linguistic input into nested, reusable units under shared memory and processing constraints. This connection between cognitive chunking and the ladderpath approach provides a new interpretation for the equi-complexity and trade-off hypotheses, grounding both in the shared cognitive architecture that underlies language processing across human languages.

翻译：我们引入梯径指数作为基于算法信息论的语言复杂度度量。该指数通过层次化复用重复子结构重建序列所需的最小步骤数，捕捉了一种可精确计算但受限的算法压缩性形式，该形式与柯尔莫哥洛夫复杂度相关但有所区别。我们将梯径方法应用于平行通用依存数据集中的21个平行语料库。梯径指数在不同语言间近似恒定，其变化幅度远小于语料库长度。当所有语料库映射为统一二进制表示时该现象更为显著，为等复杂度假说提供了独立于表征角度的证据。我们还观察到字符库规模与语料长度间的权衡，以及词汇层面与语料层面重建复杂度间的权衡，这支持了复杂度守恒并在语言层级间重新分布的总复杂度权衡假说。梯径方法识别出的可复用子结构（无需任何语言学输入）与自然词汇中存在的单词及形态成分具有重叠性。梯径方法捕捉的层次化复用与认知科学中提出的组块机制相平行——人类认知系统在共享记忆与加工限制下将语言输入压缩为嵌套式可复用单元。认知组块与梯径方法间的关联为等复杂度假说与权衡假说提供了全新解释，将两者共同归因于支撑跨人类语言处理过程的共享认知架构。

相关内容

Cognition

关注 4

Cognition：Cognition：International Journal of Cognitive Science Explanation：认知：国际认知科学杂志。 Publisher：Elsevier。 SIT： http://www.journals.elsevier.com/cognition/

论学习、公平性与复杂度

专知会员服务

11+阅读 · 2月28日

从计算理论看语言模型的scaling law和多模态模型的发展

专知会员服务

29+阅读 · 2024年6月27日

参数高效微调方法有哪些？岭大等最新《预训练语言模型的参数高效微调》综述，

专知会员服务

70+阅读 · 2023年12月21日

【普渡博士论文】具有深度层次结构和有效统计训练的可解释自然语言处理模型，121页pdf

专知会员服务

35+阅读 · 2023年11月5日