The rapid development of large language models (LLMs) has yielded impressive success in various downstream tasks. However, the vast potential and remarkable capabilities of LLMs also raise new security and privacy concerns if they are exploited for nefarious purposes due to their open-endedness. For example, LLMs may be used to plagiarize or imitate writing, thereby infringing the copyright of the original content, or to create indiscriminate fake information based on a certain source text. In some cases, LLMs can even analyze text from the Internet to infer personal privacy. Unfortunately, previous text protection research could not foresee the emergence of powerful LLMs, rendering it no longer effective in this new context. To bridge this gap, we introduce Silent Guardian (SG), a text protection mechanism against LLMs, which allows LLMs to refuse to generate response when receiving protected text, preventing the malicious use of text from the source. Specifically, we first propose the concept of Truncation Protection Examples (TPE). By carefully modifying the text to be protected, TPE can induce LLMs to first sample the end token, thus directly terminating the interaction. In addition, to efficiently construct TPE in the discrete space of text data, we propose a novel optimization algorithm called Super Taliored Protection (STP), which is not only highly efficient but also maintains the semantic consistency of the text during the optimization process. The comprehensive experimental evaluation demonstrates that SG can effectively protect the target text under various configurations and achieve almost 100% protection success rate in some cases. Notably, SG also exhibits relatively good transferability and robustness, making its application in practical scenarios possible.
翻译:大型语言模型(LLMs)的快速发展在下游任务中取得了显著成功。然而,其巨大潜力和卓越能力也引发了新的安全与隐私担忧——若因开放特性被用于恶意目的,例如利用LLMs进行抄袭或模仿写作以侵犯原创内容版权,或基于特定源文本制造无差别虚假信息。在某些情况下,LLMs甚至能通过分析互联网文本推断个人隐私。遗憾的是,以往的文本保护研究未能预见强大LLMs的出现,在新场景下已失效。为弥补这一空白,我们提出Silent Guardian(SG)——一种面向LLMs的文本保护机制,使LLMs在接收受保护文本时拒绝生成响应,从源头阻止文本的恶意利用。具体而言,我们首先提出截断保护样本(TPE)概念。通过精心修改待保护文本,TPE能诱导LLMs优先采样终止标记,直接终止交互。此外,为在文本数据的离散空间中高效构建TPE,我们提出一种名为超级定制保护(STP)的新型优化算法,该算法不仅高效,还能在优化过程中保持文本语义一致性。综合实验评估表明,SG能在多种配置下有效保护目标文本,某些情况下保护成功率接近100%。值得注意的是,SG还展现出良好的可迁移性和鲁棒性,使其在实际场景中的应用成为可能。