Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).
翻译:尽管语言模型(LM)的研究日益兴起,但鲜有方法分析语言模型的可逆性。具体而言,给定一个语言模型和一个期望的目标输出词元序列,确定何种输入提示能产生该目标输出仍是一个未解决的问题。我们将此问题形式化为经典的基于梯度的优化问题。首先,我们提出一种简单算法,实现给定(冻结)语言模型的端到端可微分性,随后通过梯度下降寻找优化后的提示。我们的核心洞见在于将语言模型视为在词元分布序列上操作的函数(而非传统上视为在词元序列上操作的函数)。我们的实验与消融研究表明,对于多个白盒语言模型(开箱即用),我们基于可微分语言模型(DLM)的逆向方法能够可靠且高效地优化长度为 $10$ 和 $80$ 的提示,以生成长度为 $20$ 的目标序列。