As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods in terms of their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.
翻译:随着大规模语言模型(LLMs)达到类人的流畅度和连贯性,区分机器生成文本(MGT)与人类撰写内容变得愈发困难。尽管早期MGT检测工作集中于二分类任务,但LLMs的快速发展和多样性要求更细粒度且更具挑战性的作者归属(AA),即能够识别文本背后的精确生成器(LLM或人类)。然而,当前AA研究仍局限于单语言场景(以英语研究最为广泛),忽视了现代LLMs的多语言特性与实际应用。本文提出多语言作者归属问题,旨在跨不同语言将文本归属于人类或多个LLM生成器。聚焦18种语言(涵盖多个语系和书写系统)及8个生成器(7个LLM与人类撰写类别),我们探究单语言AA方法在多语言场景中的适用性(涉及跨语言迁移能力)以及生成器对归属性能的影响。实验结果表明,部分单语言AA方法可适配多语言场景,但仍存在显著局限与挑战,尤其在跨不同语系迁移时表现尤为突出,这凸显了多语言AA的复杂性,以及开发更鲁棒方法以匹配真实世界场景的必要性。