It has been widely observed that language models (LMs) respond in predictable ways to algorithmically generated prompts that are seemingly unintelligible. This is both a sign that we lack a full understanding of how LMs work, and a practical challenge, because opaqueness can be exploited for harmful uses of LMs, such as jailbreaking. We present the first thorough analysis of opaque machine-generated prompts, or autoprompts, pertaining to 6 LMs of different sizes and families. We find that machine-generated prompts are characterized by a last token that is often intelligible and strongly affects the generation. A small but consistent proportion of the previous tokens are prunable, probably appearing in the prompt as a by-product of the fact that the optimization process fixes the number of tokens. The remaining tokens fall into two categories: filler tokens, which can be replaced with semantically unrelated substitutes, and keywords, that tend to have at least a loose semantic relation with the generation, although they do not engage in well-formed syntactic relations with it. Additionally, human experts can reliably identify the most influential tokens in an autoprompt a posteriori, suggesting these prompts are not entirely opaque. Finally, some of the ablations we applied to autoprompts yield similar effects in natural language inputs, suggesting that autoprompts emerge naturally from the way LMs process linguistic inputs in general.
翻译:广泛观察到,语言模型(LMs)对看似难以理解的算法生成提示会做出可预测的响应。这既表明我们尚未完全理解LMs的工作原理,也是一个实际挑战,因为这种不透明性可能被用于恶意利用LMs,例如越狱攻击。我们首次对涉及6种不同规模和系列LMs的不透明机器生成提示(即自动提示)进行了全面分析。研究发现,机器生成提示的特征在于其最后一个标记通常可理解且对生成结果有强烈影响。先前标记中有一小部分但持续存在的比例是可修剪的,它们可能作为优化过程固定标记数量的副产品出现在提示中。其余标记可分为两类:填充标记(可用语义无关的替代项替换)和关键词(尽管与生成内容未形成良好的句法关系,但往往至少存在松散的语义关联)。此外,人类专家能够后验地可靠识别自动提示中最具影响力的标记,表明这些提示并非完全不可解读。最后,我们对自动提示应用的某些消融实验在自然语言输入中产生了类似效果,这暗示自动提示本质上源于LMs处理语言输入的普遍方式。