Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.
翻译:摘要:大型语言模型(LLM)服务及模型通常附有关于使用权限及使用方式的合法规则。评估已发布LLM的合规性至关重要,因为这些规则保护LLM贡献者的利益并防止滥用。在此背景下,我们描述了黑盒身份验证(BBIV)这一新型问题。其目标是通过第三方应用的聊天功能,判断其是否使用了特定LLM。我们提出了一种名为目标随机对抗提示(TRAP)的方法,用于识别正在使用的特定LLM。我们将最初为越狱而提出的对抗性后缀重新用于从目标LLM获取预定义答案,而其他模型则给出随机答案。TRAP在单次交互后即可实现超过95%的真阳性率,且假阳性率低于0.2%。即使LLM发生不影响原始功能的微小变化,TRAP仍能保持有效性。