Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce. However, translated text differs systematically from native text. This phenomenon is known as translationese, and it reflects both traces of the source language and characteristic properties of translation itself. In this paper, we study how training on machine-translated data affects small English language models, focusing on how translationese from different source languages shapes linguistic acceptability judgments and language modelling for different domains. We train models on English text translated from 24 typologically and resource-diverse source languages, enabling a systematic analysis of how source language and corpus properties influence what models learn. Our results show that the source language has a clear impact on model behavior: general perplexity is more driven by the lexical diversity of the translated corpus, while grammatical performance is strongly correlated to typological similarity to English, given enough data.
翻译:机器翻译数据在多语言自然语言处理中被广泛使用,在原生文本稀缺时尤其如此。然而,翻译文本与原生文本存在系统性差异。这种现象被称为翻译体,它既反映了源语言的痕迹,也体现了翻译过程本身的特征属性。本文研究了在机器翻译数据上训练如何影响小型英语语言模型,重点关注来自不同源语言的翻译体如何塑造不同领域的语言可接受性判断与语言建模能力。我们使用从24种类型学特征和资源状况各异的源语言翻译而来的英语文本训练模型,从而能够系统分析源语言与语料库属性如何影响模型习得的内容。研究结果表明,源语言对模型行为具有显著影响:在数据充足的条件下,整体困惑度更多受翻译语料库的词汇多样性驱动,而语法表现则与源语言同英语的类型学相似度高度相关。