The prevalence and high capacity of large language models (LLMs) present significant safety and ethical risks when malicious users exploit them for automated content generation. To prevent the potentially deceptive usage of LLMs, recent works have proposed several algorithms to detect machine-generated text. In this paper, we systematically test the reliability of the existing detectors, by designing two types of attack strategies to fool the detectors: 1) replacing words with their synonyms based on the context; 2) altering the writing style of generated text. These strategies are implemented by instructing LLMs to generate synonymous word substitutions or writing directives that modify the style without human involvement, and the LLMs leveraged in the attack can also be protected by detectors. Our research reveals that our attacks effectively compromise the performance of all tested detectors, thereby underscoring the urgent need for the development of more robust machine-generated text detection systems.
翻译:大规模语言模型(LLMs)的广泛应用与高容量特性,在恶意用户利用其进行自动化内容生成时带来了显著的安全与伦理风险。为防止LLMs被潜在欺骗性使用,近期研究提出了多种检测机器生成文本的算法。本文通过设计两类攻击策略来系统性地测试现有检测器的可靠性:1)基于上下文用同义词替换词语;2)改变生成文本的写作风格。这些策略通过指令LLMs生成同义词替换或写作指令(无需人工介入)来实现,并且在攻击中利用的LLMs本身也可能受到检测器的保护。我们的研究表明,所提出的攻击能有效削弱所有被测试检测器的性能,从而凸显了开发更鲁棒的机器生成文本检测系统的迫切需求。