Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges

from arxiv, ACM SIGKDD Exploration. 12 pages. Additional resources, including a regularly updated list of related papers, and LLM-generated text detectors, are available at https://llm-authorship.github.io

Accurate attribution of authorship is crucial for maintaining the integrity of digital content, improving forensic investigations, and mitigating the risks of misinformation and plagiarism. Addressing the imperative need for proper authorship attribution is essential to uphold the credibility and accountability of authentic authorship. The rapid advancements of Large Language Models (LLMs) have blurred the lines between human and machine authorship, posing significant challenges for traditional methods. We present a comprehensive literature review that examines the latest research on authorship attribution in the era of LLMs. This survey systematically explores the landscape of this field by categorizing four representative problems: (1) Human-written Text Attribution; (2) LLM-generated Text Detection; (3) LLM-generated Text Attribution; and (4) Human-LLM Co-authored Text Attribution. We also discuss the challenges related to ensuring the generalization and explainability of authorship attribution methods. Generalization requires the ability to generalize across various domains, while explainability emphasizes providing transparent and understandable insights into the decisions made by these models. By evaluating the strengths and limitations of existing methods and benchmarks, we identify key open problems and future research directions in this field. This literature review serves as a roadmap for researchers and practitioners interested in understanding the state of the art in this rapidly evolving field.

翻译：准确归属作者身份对于维护数字内容的完整性、提升取证调查能力以及降低虚假信息和抄袭风险至关重要。解决作者身份合理归属的迫切需求，对于维护真实作者的可信度和责任担当不可或缺。大型语言模型（LLM）的快速发展模糊了人类与机器创作之间的界限，对传统方法构成了重大挑战。我们呈现了一篇全面的文献综述，审视了LLM时代下作者身份归属的最新研究。本综述通过分类四大典型问题系统性地探讨了该领域的格局：（1）人类撰写文本的归属；（2）LLM生成文本的检测；（3）LLM生成文本的归属；以及（4）人类与LLM合著文本的归属。我们还讨论了与确保作者身份归属方法的泛化性和可解释性相关的挑战。泛化性要求具备跨不同领域泛化的能力，而可解释性则强调为这些模型做出的决策提供透明且易于理解的洞察。通过评估现有方法与基准的优势与局限，我们识别出该领域的关键开放性问题及未来研究方向。这篇文献综述旨在为有志于理解这一快速演进领域最新技术的研究人员和实践者提供路线图。