The Random Language Model (De Giuli 2019) is an ensemble of stochastic context-free grammars, quantifying the syntax of human and computer languages. The model suggests a simple picture of first language learning as a type of annealing in the vast space of potential languages. In its simplest formulation, it implies a single continuous transition to grammatical syntax, at which the symmetry among potential words and categories is spontaneously broken. Here this picture is scrutinized by considering its robustness against extensions of the original model, and trajectories through parameter space different from those originally considered. It is shown here that (i) the scenario is robust to explicit symmetry breaking, an inevitable component of learning in the real world; and (ii) the transition to grammatical syntax can be encountered by fixing the deep (hidden) structure while varying the surface (observable) properties. It is also argued that the transition becomes a sharp thermodynamic transition in an idealized limit. Moreover, comparison with human data on the clustering coefficient of syntax networks suggests that the observed transition is equivalent to that normally experienced by children at age 24 months. The results are discussed in light of theory of first-language acquisition in linguistics, and recent successes in machine learning.
翻译:随机语言模型(De Giuli 2019)是随机上下文无关文法的一种集成模型,用于量化人类语言和计算机语言的句法结构。该模型将第一语言学习简单刻画为潜在语言浩瀚空间中的一种退火过程。在最简形式下,该模型暗示存在一个通向语法句法的连续相变,在此相变中潜在词汇与范畴间的对称性被自发破缺。本文通过考察该模型对原始模型扩展的鲁棒性,以及沿参数空间不同于原始考虑的轨迹,对这一图景进行了深入审视。研究表明:(i) 该情景对显式对称性破缺——现实世界语言学习中不可避免的组成部分——具有鲁棒性;(ii) 通过在变化表面(可观测)性质的同时固定深层(隐藏)结构,可以触及向语法句法的相变。此外,本文论证了在理想化极限下,该相变会演变为尖锐的热力学相变。进一步,通过与人类句法网络聚类系数的数据对比,表明观测到的相变等效于儿童通常在24月龄经历的阶段。最后,结合语言学中的第一语言习得理论及机器学习领域的最新成就,对相关结果进行了讨论。