AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl corpus, refining a split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text. For each language, we derive ranked AI-overused lemmas using log prevalence ratios. We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with 'emphasize'-type verbs appearing in 24 of 34 languages. Embedding-based and manual analyses support this pattern. We also examine diachronic uptake in news writing before and after ChatGPT's release. Tracking each language's top 20 AI-overused items, we find prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English. We validate our approach extensively, including across seeds, model variants, data sizes, model families, and more. Our findings are consistent with the view that AI-associated lexical preferences extend beyond English and may exert cross-lingual homogenising pressure on global language use.
翻译:人工智能关联的词汇变迁主要记录在科学英语中。我们将此项研究扩展到WMT新闻抓取语料库中的34种语言,完善了一种半区间延续诊断方法,该方法将GPT-4.1生成的延续文本与配对的人工黄金标准文本进行比较。针对每种语言,我们利用对数流行率比率推导出被人工智能过度使用的词汇排名。我们发现显著的跨语言语义趋同:语义相关的概念在类型多样的语言中反复出现,其中"强调"类动词出现在34种语言中的24种。基于嵌入向量和人工的分析也支持这一模式。我们还考察了ChatGPT发布前后新闻写作中的历时采纳情况。追踪每种语言前20个人工智能过度使用的词汇,我们发现从2020-2021年到2023-2024年,34种语言中有26种的流行率有所上升,平均变化为+15.1%,而匹配的基线词汇则没有类似的增长(-4.5%)。在具有更长时间历史覆盖范围的10种语言中,纵向分析显示2022年后的增长超过了早期观察到的适度变化,尽管效应量小于科学英语中的情况。我们通过多种方式(包括不同随机种子、模型变体、数据规模、模型系列等)广泛验证了我们的方法。我们的研究结果与以下观点一致:人工智能关联的词汇偏好已超出英语范畴,并可能对全球语言使用施加跨语言同质化压力。