We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep. These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity. Furthermore, by comparing the readout estimates to a user-defined target, and back-propagating the gradient through the readout head, these estimates can be used to guide the sampling process. Compared to prior methods for conditional generation, Readout Guidance requires significantly fewer added parameters and training samples, and offers a convenient and simple recipe for reproducing different forms of conditional control under a single framework, with a single architecture and sampling procedure. We showcase these benefits in the applications of drag-based manipulation, identity-consistent generation, and spatially aligned control. Project page: https://readout-guidance.github.io.
翻译:我们提出读出引导(Readout Guidance),一种利用学习信号控制文本到图像扩散模型的方法。读出引导采用读出头(readout heads),即轻量级网络,在每个时间步从预训练且冻结的扩散模型特征中提取信号。这些读出可编码单图像属性(如姿态、深度和边缘),或与多图像相关的高阶属性(如对应关系和外观相似性)。此外,通过将读出估计与用户定义的目标进行比较,并反向传播梯度通过读出头,这些估计可用于引导采样过程。与先前的条件生成方法相比,读出引导所需额外参数和训练样本显著更少,并在统一框架下、采用单一架构和采样过程,提供便捷简单的配方来复现不同形式的条件控制。我们在基于拖拽的操作、身份一致性生成和空间对齐控制等应用中展示了这些优势。项目页面:https://readout-guidance.github.io。