The rapid development of large language models (LLMs) has yielded impressive success in various downstream tasks. However, the vast potential and remarkable capabilities of LLMs also raise new security and privacy concerns if they are exploited for nefarious purposes due to their open-endedness. For example, LLMs may be used to plagiarize or imitate writing, thereby infringing the copyright of the original content, or to create indiscriminate fake information based on a certain source text. In some cases, LLMs can even analyze text from the Internet to infer personal privacy. Unfortunately, previous text protection research could not foresee the emergence of powerful LLMs, rendering it no longer effective in this new context. To bridge this gap, we introduce Silent Guardian (SG), a text protection mechanism against LLMs, which allows LLMs to refuse to generate response when receiving protected text, preventing the malicious use of text from the source. Specifically, we first propose the concept of Truncation Protection Examples (TPE). By carefully modifying the text to be protected, TPE can induce LLMs to first sample the end token, thus directly terminating the interaction. In addition, to efficiently construct TPE in the discrete space of text data, we propose a novel optimization algorithm called Super Taliored Protection (STP), which is not only highly efficient but also maintains the semantic consistency of the text during the optimization process. The comprehensive experimental evaluation demonstrates that SG can effectively protect the target text under various configurations and achieve almost 100% protection success rate in some cases. Notably, SG also exhibits relatively good transferability and robustness, making its application in practical scenarios possible.
翻译:大型语言模型的快速发展在各种下游任务中取得了显著成功。然而,其广泛潜力与卓越能力也因其开放性而引发新的安全与隐私担忧——例如,LLMs可能被用于剽窃或模仿写作,侵犯原创内容的版权,或基于特定源文本制造无差别虚假信息。在某些情况下,LLMs甚至能分析互联网文本以推断个人隐私。遗憾的是,先前的文本保护研究未能预见到强大LLMs的出现,导致其在当前场景下不再有效。为弥补这一缺憾,我们提出"无声守护者"——一种针对LLMs的文本保护机制,可使LLMs在接收到受保护文本时拒绝生成回应,从而从源头阻止文本的恶意使用。具体而言,我们首先提出截断保护样本(TPE)的概念。通过对需保护文本进行细致修改,TPE能诱导LLMs优先采样结束标记,直接终止交互。此外,为在文本数据的离散空间中高效构建TPE,我们提出一种名为"超定制保护"的新型优化算法,该算法不仅具有极高效率,还能在优化过程中保持文本的语义一致性。全面实验评估表明,SG可在多种配置下有效保护目标文本,部分情形下近乎实现100%的保护成功率。值得注意的是,SG还展现出良好的可迁移性与鲁棒性,使其在实际场景中的应用成为可能。