Recent advances in diffusion models have significantly improved image and video synthesis. In addition, several concept control methods have been proposed to enable fine-grained, continuous, and flexible control over free-form text prompts. However, these methods not only require intensive training time and GPU memory usage to learn the sliders or embeddings but also need to be retrained for different diffusion backbones, limiting their scalability and adaptability. To address these limitations, we introduce Text Slider, a lightweight, efficient and plug-and-play framework that identifies low-rank directions within a pre-trained text encoder, enabling continuous control of visual concepts while significantly reducing training time, GPU memory consumption, and the number of trainable parameters. Furthermore, Text Slider supports multi-concept composition and continuous control, enabling fine-grained and flexible manipulation in both image and video synthesis. We show that Text Slider enables smooth and continuous modulation of specific attributes while preserving the original spatial layout and structure of the input. Text Slider achieves significantly better efficiency: 5$\times$ faster training than Concept Slider and 47$\times$ faster than Attribute Control, while reducing GPU memory usage by nearly 2$\times$ and 4$\times$, respectively.
翻译:近期扩散模型的进展显著提升了图像与视频合成质量。为增强对自由形态文本提示的细粒度、连续且灵活的控制能力,研究者提出了多种概念控制方法。然而现有方法不仅需要消耗大量训练时间和GPU内存来学习滑动条或嵌入向量,还需针对不同扩散主干网络重新训练,限制了其可扩展性与适应性。为解决上述问题,我们提出Text Slider——一种轻量级、高效且即插即用的框架,通过识别预训练文本编码器中的低秩方向,在显著降低训练时间、GPU内存消耗与可训练参数量的同时,实现对视觉概念的连续控制。此外,Text Slider支持多概念组合与连续控制,可对图像与视频合成过程进行细粒度、灵活的操作。实验证明,Text Slider能够平滑连续地调制特定属性,同时保持原始输入的空间布局与结构。在效率方面,Text Slider相比Concept Slider训练速度提升5倍,相较于Attribute Control加速47倍,同时GPU内存占用分别降低约2倍与4倍。