With the help of simple fine-tuning, one can artificially embed hidden text into large language models (LLMs). This text is revealed only when triggered by a specific query to the LLM. Two primary applications are LLM fingerprinting and steganography. In the context of LLM fingerprinting, a unique text identifier (fingerprint) is embedded within the model to verify licensing compliance. In the context of steganography, the LLM serves as a carrier for hidden messages that can be disclosed through a chosen trigger question. Our work demonstrates that embedding hidden text in the LLM via fine-tuning, though seemingly secure due to the vast number of potential triggers (any sequence of characters or tokens could serve as a trigger), is susceptible to extraction through analysis of the LLM's output decoding process. We propose an extraction attack called Unconditional Token Forcing (UTF). It is premised on the hypothesis that iteratively feeding each token from the LLM's vocabulary into the model should reveal output sequences with abnormally high token probabilities, indicating potential hidden text candidates. We also present a defense method to hide text in such a way that it is resistant to both UTF and attacks based on sampling decoding methods, which we named Unconditional Token Forcing Confusion (UTFC). To the best of our knowledge, there is no attack method that can extract text hidden with UTFC. UTFC has both benign applications (improving LLM fingerprinting) and malign applications (using LLMs to create covert communication channels). Code is available at github.com/j-hoscilowic/zurek-stegano
翻译:通过简单的微调,可以在大型语言模型(LLMs)中人为嵌入隐藏文本。该文本仅在通过特定查询触发LLM时才会显现。其主要应用包括LLM指纹识别和隐写术。在LLM指纹识别场景中,通过嵌入唯一的文本标识符(指纹)来验证许可合规性。在隐写术场景中,LLM作为隐藏信息的载体,可通过选定的触发问题揭示信息。我们的研究表明,尽管通过微调在LLM中嵌入隐藏文本看似安全(因为任何字符或标记序列都可能作为触发器),但通过分析LLM的输出解码过程仍可能被提取。我们提出了一种名为无条件令牌强制(UTF)的提取攻击方法。该方法基于以下假设:将词汇表中的每个令牌迭代输入模型,应能揭示具有异常高令牌概率的输出序列,从而指示潜在的隐藏文本候选。我们还提出了一种防御方法,能够以抵抗UTF和基于采样解码方法攻击的方式隐藏文本,该方法被命名为无条件令牌强制混淆(UTFC)。据我们所知,目前尚无攻击方法能提取通过UTFC隐藏的文本。UTFC既具有良性应用(改进LLM指纹识别),也可能被恶意利用(通过LLMs建立隐蔽通信渠道)。代码发布于github.com/j-hoscilowic/zurek-stegano