Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.
翻译:近期研究表明,在合成语料库上训练的语言模型能够表现出类似于人类语言跨语言规律的类型学偏好,特别是在词序等句法现象方面。本文将该研究范式扩展至差异论元标记——一种形态标记依赖于语义显著性的语义允准系统。通过受控合成学习方法,我们在18个实现不同DAM系统的语料库上训练GPT-2模型,并使用最小对立对评估其泛化能力。研究结果揭示了DAM两个类型学维度之间的分离现象:模型始终表现出类人的自然标记方向偏好,倾向于显性标记针对语义非典型论元的系统;与之相反,模型未能复现人类语言中强烈的宾语偏好(即DAM中的显性标记更常针对宾语而非主语)。这些发现表明,不同的类型学倾向可能源于不同的底层机制。