Authorship Attribution in Multilingual Machine-Generated Texts

As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods in terms of their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

翻译：随着大规模语言模型（LLMs）达到类人的流畅度和连贯性，区分机器生成文本（MGT）与人类撰写内容变得愈发困难。尽管早期MGT检测工作集中于二分类任务，但LLMs的快速发展和多样性要求更细粒度且更具挑战性的作者归属（AA），即能够识别文本背后的精确生成器（LLM或人类）。然而，当前AA研究仍局限于单语言场景（以英语研究最为广泛），忽视了现代LLMs的多语言特性与实际应用。本文提出多语言作者归属问题，旨在跨不同语言将文本归属于人类或多个LLM生成器。聚焦18种语言（涵盖多个语系和书写系统）及8个生成器（7个LLM与人类撰写类别），我们探究单语言AA方法在多语言场景中的适用性（涉及跨语言迁移能力）以及生成器对归属性能的影响。实验结果表明，部分单语言AA方法可适配多语言场景，但仍存在显著局限与挑战，尤其在跨不同语系迁移时表现尤为突出，这凸显了多语言AA的复杂性，以及开发更鲁棒方法以匹配真实世界场景的必要性。

相关内容

生成器

关注 2

生成器是一次生成一个值的特殊类型函数。可以将其视为可恢复函数。调用该函数将返回一个可用于生成连续 x 值的生成【Generator】，简单的说就是在函数的执行过程中，yield语句会把你需要的值返回给调用生成器的地方，然后退出函数，下一次调用生成器函数的时候又从上次中断的地方开始执行，而生成器内的所有变量参数都会被保存下来供下一次使用。

大型语言模型遇上文本属性图：一种融合框架与应用的综述

专知会员服务

10+阅读 · 2025年10月27日

【NeurIPS2025】DNA-DetectLLM：基于 DNA 启发的“突变-修复”范式揭示 AI 生成文本

专知会员服务

12+阅读 · 2025年9月22日

基于文本引导的分子发现中大型语言模型综述：从分子生成到优化

专知会员服务

7+阅读 · 2025年5月24日

大模型如何生成可控文本？人大等最新《大型语言模型的可控文本生成》综述

专知会员服务

37+阅读 · 2024年8月23日