Large language models (LLMs) are increasingly being used for generating text in a variety of use cases, including journalistic news articles. Given the potential malicious nature in which these LLMs can be used to generate disinformation at scale, it is important to build effective detectors for such AI-generated text. Given the surge in development of new LLMs, acquiring labeled training data for supervised detectors is a bottleneck. However, there might be plenty of unlabeled text data available, without information on which generator it came from. In this work we tackle this data problem, in detecting AI-generated news text, and frame the problem as an unsupervised domain adaptation task. Here the domains are the different text generators, i.e. LLMs, and we assume we have access to only the labeled source data and unlabeled target data. We develop a Contrastive Domain Adaptation framework, called ConDA, that blends standard domain adaptation techniques with the representation power of contrastive learning to learn domain invariant representations that are effective for the final unsupervised detection task. Our experiments demonstrate the effectiveness of our framework, resulting in average performance gains of 31.7% from the best performing baselines, and within 0.8% margin of a fully supervised detector. All our code and data is available at https://github.com/AmritaBh/ConDA-gen-text-detection.
翻译:大语言模型(LLMs)正越来越多地被用于各类文本生成场景,包括新闻文章。鉴于这些大语言模型可能被恶意用于大规模制造虚假信息,构建针对此类AI生成文本的有效检测器至关重要。随着新LLMs的快速涌现,为监督式检测器获取标注训练数据成为瓶颈。然而,通常存在大量未标注的文本数据,且缺乏其生成来源信息。本研究针对AI生成新闻文本检测中的数据问题,将该任务形式化为无监督域适配问题。此处的域指代不同文本生成器(即LLMs),我们假设仅能获取标注源域数据与未标注目标域数据。我们提出名为ConDA的对比域适配框架,该框架融合标准域适配技术与对比学习的表征能力,以学习对最终无监督检测任务有效的域不变表征。实验表明,本框架相较于最优基线模型平均性能提升31.7%,与完全监督式检测器的性能差距在0.8%以内。所有代码与数据已开源至https://github.com/AmritaBh/ConDA-gen-text-detection。