Optimization methods are widely employed in deep learning to identify and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed the \emph{functional homotopy} method, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a $20\%-30\%$ improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.
翻译:优化方法在深度学习中广泛应用于识别和缓解模型的不良响应。尽管基于梯度的技术在图像模型中已被证明有效,但其在语言模型中的应用因输入空间的离散特性而受到阻碍。本研究提出了一种新颖的优化方法,称为函数同伦方法,该方法利用了模型训练与输入生成之间的函数对偶性。通过构建一系列从易到难的优化问题,我们基于既有的同伦方法原理迭代求解这些问题。我们将此方法应用于大语言模型的越狱攻击生成,在规避已建立的安全开源模型(如Llama-2和Llama-3)方面,成功率比现有方法提高了20%-30%。