Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full sequence. In this work, we leverage the rich toolkit of Sequential Monte Carlo (SMC) for these probabilistic inference problems. In particular, we use learned twist functions to estimate the expected future value of the potential at each timestep, which enables us to focus inference-time computation on promising partial sequences. We propose a novel contrastive method for learning the twist functions, and establish connections with the rich literature of soft reinforcement learning. As a complementary application of our twisted SMC framework, we present methods for evaluating the accuracy of language model inference techniques using novel bidirectional SMC bounds on the log partition function. These bounds can be used to estimate the KL divergence between the inference and target distributions in both directions. We apply our inference evaluation techniques to show that twisted SMC is effective for sampling undesirable outputs from a pretrained model (a useful component of harmlessness training and automated red-teaming), generating reviews with varied sentiment, and performing infilling tasks.
翻译:大型语言模型(LLMs)的诸多能力与安全技术(包括RLHF、自动化红队测试、提示工程以及文本填充)均可归结为从由给定奖励函数或势函数定义的未归一化目标分布中采样完整序列的过程。本文利用序贯蒙特卡洛(SMC)的丰富工具集来解决这些概率推断问题。具体而言,我们采用学习得到的扭曲函数来估计每个时间步上势函数的期望未来值,从而将推断阶段的计算资源聚焦于具有高潜力的部分序列。我们提出了一种新颖的对比学习方法用于学习扭曲函数,并揭示了该方法与软强化学习文献之间的深层联系。作为扭曲SMC框架的互补性应用,我们提出了通过新颖的双向SMC对数配分函数界来评估语言模型推断技术准确性的方法。这些边界可用于估算推断分布与目标分布之间双向的KL散度。我们通过实验验证了扭曲SMC在以下任务中的有效性:从预训练模型中采样有害输出(无害训练与自动化红队测试的关键组件)、生成情感多样化的评论以及执行文本填充任务。