It has been widely observed that language models (LMs) respond in predictable ways to algorithmically generated prompts that are seemingly unintelligible. This is both a sign that we lack a full understanding of how LMs work, and a practical challenge, because opaqueness can be exploited for harmful uses of LMs, such as jailbreaking. We present the first thorough analysis of opaque machine-generated prompts, or autoprompts, pertaining to 3 LMs of different sizes and families. We find that machine-generated prompts are characterized by a last token that is often intelligible and strongly affects the generation. A small but consistent proportion of the previous tokens are fillers that probably appear in the prompt as a by-product of the fact that the optimization process fixes the number of tokens. The remaining tokens tend to have at least a loose semantic relation with the generation, although they do not engage in well-formed syntactic relations with it. We find moreover that some of the ablations we applied to machine-generated prompts can also be applied to natural language sequences, leading to similar behavior, suggesting that autoprompts are a direct consequence of the way in which LMs process linguistic inputs in general.
翻译:人们普遍观察到,语言模型(LMs)对看似难以理解的算法生成提示会以可预测的方式作出响应。这既表明我们尚未完全理解语言模型的工作原理,也构成了一个实际挑战,因为这种不透明性可能被滥用于有害用途,例如越狱攻击。我们首次对涉及三种不同规模和系列的语言模型的不透明机器生成提示(即自动提示)进行了全面分析。我们发现,机器生成提示的特征在于其最后一个标记通常可理解且对生成结果有强烈影响。先前标记中有一小部分但持续存在的填充词,它们可能作为优化过程固定标记数量这一事实的副产品出现在提示中。其余标记往往与生成内容至少存在松散的语义关联,尽管它们并未与之形成良好的句法关系。此外,我们发现应用于机器生成提示的部分消融实验同样可应用于自然语言序列,并导致类似行为,这表明自动提示本质上是语言模型处理语言输入方式的直接结果。