Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.
翻译:实现机器自主性与人类控制往往是交互式AI系统设计中的分歧目标。诸如Stable Diffusion等视觉生成基础模型在应对这些目标时展现出潜力,特别是结合任意语言提示的情况下。然而,这些模型在生成具有空间、结构或几何控制性的图像时往往力有不逮。如何在单一统一模型中整合这些可容纳多种视觉条件的控制机制,仍是一个尚未解决的挑战。为此,我们提出UniControl——一种新型生成基础模型,可将广泛的可控条件到图像(C2I)任务整合于单一框架内,同时仍支持任意语言提示。UniControl实现了像素级精确的图像生成,其中视觉条件主要影响生成结构,语言提示则引导风格与上下文。为赋予UniControl处理多样化视觉条件的能力,我们对预训练文本到图像扩散模型进行增强,并引入任务感知型HyperNet来调控扩散模型,使其能同时适应不同C2I任务。在九种独特C2I任务上的训练结果表明,UniControl在面对未见视觉条件时展现出令人瞩目的零样本生成能力。实验显示,UniControl在模型规模相当的情况下,性能常超越单一任务控制方法。这种控制多功能性使UniControl成为可控视觉生成领域的重要突破。