We investigate structural traces of language contact in the intermediate representations of a monolingual language model. Focusing on Persian (Farsi) as a historically contact-rich language, we probe the representations of a Persian-trained model when exposed to languages with varying degrees and types of contact with Persian. Our methodology quantifies the amount of linguistic information encoded in intermediate representations and assesses how this information is distributed across model components for different morphosyntactic features. The results show that universal syntactic information is largely insensitive to historical contact, whereas morphological features such as Case and Gender are strongly shaped by language-specific structure, suggesting that contact effects in monolingual language models are selective and structurally constrained.
翻译:本研究通过分析单语语言模型的中间表示,探究语言接触在其中的结构痕迹。我们以波斯语(Farsi)这一历史上接触丰富的语言为研究对象,考察一个波斯语训练模型在接触与波斯语具有不同接触程度及类型的语言时所产生的表示。我们的方法量化了中间表示中编码的语言信息量,并评估了不同形态句法特征对应的信息在模型各组件中的分布情况。结果表明,通用句法信息对历史接触基本不敏感,而诸如格和性等形态特征则深受语言特定结构的影响,这表明单语语言模型中的接触效应具有选择性,并受结构约束。