Diffusion models have demonstrated remarkable and robust abilities in both image and video generation. To achieve greater control over generated results, researchers introduce additional architectures, such as ControlNet, Adapters and ReferenceNet, to integrate conditioning controls. However, current controllable generation methods often require substantial additional computational resources, especially for video generation, and face challenges in training or exhibit weak control. In this paper, we propose ControlNeXt: a powerful and efficient method for controllable image and video generation. We first design a more straightforward and efficient architecture, replacing heavy additional branches with minimal additional cost compared to the base model. Such a concise structure also allows our method to seamlessly integrate with other LoRA weights, enabling style alteration without the need for additional training. As for training, we reduce up to 90% of learnable parameters compared to the alternatives. Furthermore, we propose another method called Cross Normalization (CN) as a replacement for Zero-Convolution' to achieve fast and stable training convergence. We have conducted various experiments with different base models across images and videos, demonstrating the robustness of our method.
翻译:扩散模型在图像和视频生成领域已展现出卓越且稳健的能力。为实现对生成结果的更强控制,研究者引入了额外的架构,如ControlNet、Adapters和ReferenceNet,以整合条件控制。然而,当前的可控生成方法通常需要大量额外的计算资源(尤其在视频生成中),并面临训练挑战或表现出较弱的控制能力。本文提出ControlNeXt:一种强大且高效的可控图像与视频生成方法。我们首先设计了一种更直接高效的架构,相较于基础模型,以极小的额外成本替代了繁重的附加分支。这种简洁结构还使我们的方法能够无缝集成其他LoRA权重,实现无需额外训练的样式调整。在训练方面,与现有方法相比,我们减少了高达90%的可学习参数。此外,我们提出另一种称为交叉归一化(CN)的方法,以替代"零卷积"实现快速稳定的训练收敛。我们在不同基础模型上对图像和视频进行了多样实验,验证了本方法的鲁棒性。