Two interlocking research questions of growing interest and importance in privacy research are Authorship Attribution (AA) and Authorship Obfuscation (AO). Given an artifact, especially a text t in question, an AA solution aims to accurately attribute t to its true author out of many candidate authors while an AO solution aims to modify t to hide its true authorship. Traditionally, the notion of authorship and its accompanying privacy concern is only toward human authors. However, in recent years, due to the explosive advancements in Neural Text Generation (NTG) techniques in NLP, capable of synthesizing human-quality open-ended texts (so-called "neural texts"), one has to now consider authorships by humans, machines, or their combination. Due to the implications and potential threats of neural texts when used maliciously, it has become critical to understand the limitations of traditional AA/AO solutions and develop novel AA/AO solutions in dealing with neural texts. In this survey, therefore, we make a comprehensive review of recent literature on the attribution and obfuscation of neural text authorship from a Data Mining perspective, and share our view on their limitations and promising research directions.
翻译:隐私研究中日益受到关注的两个相互关联的研究问题是作者归属(AA)与作者混淆(AO)。给定一个作品,尤其是待分析的文本t,AA解决方案旨在从众多候选作者中准确识别t的真实作者,而AO解决方案则旨在修改t以隐藏其真实作者身份。传统上,作者身份的概念及其伴随的隐私关注仅针对人类作者。然而,近年来,由于自然语言处理中神经文本生成(NTG)技术的爆炸式发展——能够合成媲美人类质量的开放式文本(即"神经文本"),人们现在必须考虑人类、机器或其组合的作者身份。鉴于神经文本在被恶意使用时可能带来的潜在威胁,理解传统AA/AO解决方案的局限性,并开发针对神经文本的新型AA/AO解决方案变得至关重要。因此,本综述从数据挖掘视角全面梳理了神经文本作者归属与混淆领域的最新文献,并对现有方法的局限性与未来研究方向提出了见解。