The paper explores how the human natural language structure can be seen as a product of evolution of inter-personal communication code, targeting maximisation of such culture-agnostic and cross-lingual metrics such as anti-entropy, compression factor and cross-split F1 score. The exploration is done as part of a larger unsupervised language learning effort, the attempt is made to perform meta-learning in a space of hyper-parameters maximising F1 score based on the "ground truth" language structure, by means of maximising the metrics mentioned above. The paper presents preliminary results of cross-lingual word-level segmentation tokenisation study for Russian, Chinese and English as well as subword segmentation or morphological parsing study for English. It is found that language structure form the word-level segmentation or tokenisation can be found as driven by all of these metrics, anti-entropy being more relevant to English and Russian while compression factor more specific for Chinese. The study for subword segmentation or morphological parsing on English lexicon has revealed straight connection between the compression been found to be associated with compression factor, while, surprising, the same connection with anti-entropy has turned to be the inverse.
翻译:本文探讨人类自然语言结构如何被视为人际通信代码演化的产物,其目标是最大化反熵、压缩因子和交叉分割F1分数等跨文化与跨语言无关的指标。本研究作为更大规模的无监督语言学习工作的一部分,尝试在超参数空间中进行元学习,通过最大化上述指标来优化基于“真实”语言结构的F1分数。论文展示了针对俄语、中文和英语的跨语言词语级分割标记化研究,以及针对英语的子词分割或形态分析研究的初步结果。研究发现,词语级分割或标记化中的语言结构可由所有这些指标驱动,其中反熵对英语和俄语更具相关性,而压缩因子则更适用于中文。针对英语词库的子词分割或形态分析研究揭示了压缩与压缩因子之间的直接关联,但令人惊讶的是,压缩与反熵之间却呈现反向关系。