Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.
翻译:大型语言模型(LLM)服务与模型通常附带有关于使用者和使用方式的法律规定。评估已发布LLM的合规性至关重要,因为这些规定保护了LLM贡献者的利益并防止滥用。在此背景下,我们阐述了黑盒身份验证(BBIV)这一新型指纹识别问题,其目标是通过聊天功能判断第三方应用程序是否使用了特定LLM。我们提出了一种名为定向随机对抗提示(TRAP)的方法来识别正在使用的具体LLM。我们将原本为越狱攻击设计的对抗性后缀进行改造,使其能够从目标LLM获取预定义答案,而其他模型则产生随机响应。TRAP仅需单次交互即可实现超过95%的真阳性率,同时保持低于0.2%的假阳性率。即使目标LLM发生不显著改变原始功能的微小调整,TRAP仍能保持有效识别能力。