Large Language Models as Carriers of Hidden Messages

With the help of simple fine-tuning, one can artificially embed hidden text into large language models (LLMs). This text is revealed only when triggered by a specific query to the LLM. Two primary applications are LLM fingerprinting and steganography. In the context of LLM fingerprinting, a unique text identifier (fingerprint) is embedded within the model to verify licensing compliance. In the context of steganography, the LLM serves as a carrier for hidden messages that can be disclosed through a chosen trigger question. Our work demonstrates that embedding hidden text in the LLM via fine-tuning, though seemingly secure due to the vast number of potential triggers (any sequence of characters or tokens could serve as a trigger), is susceptible to extraction through analysis of the LLM's output decoding process. We propose an extraction attack called Unconditional Token Forcing (UTF). It is premised on the hypothesis that iteratively feeding each token from the LLM's vocabulary into the model should reveal output sequences with abnormally high token probabilities, indicating potential hidden text candidates. We also present a defense method to hide text in such a way that it is resistant to both UTF and attacks based on sampling decoding methods, which we named Unconditional Token Forcing Confusion (UTFC). To the best of our knowledge, there is no attack method that can extract text hidden with UTFC. UTFC has both benign applications (improving LLM fingerprinting) and malign applications (using LLMs to create covert communication channels). Code is available at github.com/j-hoscilowic/zurek-stegano

翻译：通过简单的微调，可以在大型语言模型（LLMs）中人为嵌入隐藏文本。该文本仅在通过特定查询触发LLM时才会显现。其主要应用包括LLM指纹识别和隐写术。在LLM指纹识别场景中，通过嵌入唯一的文本标识符（指纹）来验证许可合规性。在隐写术场景中，LLM作为隐藏信息的载体，可通过选定的触发问题揭示信息。我们的研究表明，尽管通过微调在LLM中嵌入隐藏文本看似安全（因为任何字符或标记序列都可能作为触发器），但通过分析LLM的输出解码过程仍可能被提取。我们提出了一种名为无条件令牌强制（UTF）的提取攻击方法。该方法基于以下假设：将词汇表中的每个令牌迭代输入模型，应能揭示具有异常高令牌概率的输出序列，从而指示潜在的隐藏文本候选。我们还提出了一种防御方法，能够以抵抗UTF和基于采样解码方法攻击的方式隐藏文本，该方法被命名为无条件令牌强制混淆（UTFC）。据我们所知，目前尚无攻击方法能提取通过UTFC隐藏的文本。UTFC既具有良性应用（改进LLM指纹识别），也可能被恶意利用（通过LLMs建立隐蔽通信渠道）。代码发布于github.com/j-hoscilowic/zurek-stegano

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日