Language models produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27\%$ of prompts exactly. Code for reproducing all experiments is available at http://github.com/jxmorris12/vec2text.
翻译:语言模型会生成关于下一个标记的概率分布;我们能否利用这些信息恢复提示中的标记?本文研究了语言模型逆向问题,并表明下一个标记的概率中包含关于前文文本的惊人信息量。通常,在用户无法直接获取文本的情况下,我们仍能恢复文本,这促使了一种方法的提出——在仅获得模型当前分布输出的情况下,恢复未知的提示。我们考虑了多种模型访问场景,并展示了即使没有词汇表中每个标记的预测,我们也能通过搜索恢复概率向量。在Llama-2 7b模型上,我们的逆向方法以BLEU值59和标记级F1值78重建了提示,并精确恢复了27%的提示。重现所有实验的代码可在http://github.com/jxmorris12/vec2text获取。