Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.
翻译:实现机器自主性与人类控制在交互式AI系统设计中常被视为对立目标。基于视觉生成基础模型(如Stable Diffusion)在通过任意语言提示处理这些目标时展现出潜力,但在生成具备空间、结构或几何控制的图像时往往表现不足。如何将多种视觉条件整合至单一统一模型仍是一个未解决的挑战。为此,我们提出UniControl——一种新型生成基础模型,在允许任意语言提示的同时,将广泛的可控条件到图像(C2I)任务统一于单一框架中。UniControl实现像素级精确图像生成:视觉条件主要影响生成结构,语言提示则引导风格与语境。为赋予UniControl处理多样视觉条件的能力,我们扩展了预训练的文本到图像扩散模型,并引入任务感知型HyperNet对扩散模型进行调制,使其能同时适配不同C2I任务。基于九种独特C2I任务训练的UniControl展现出对未见视觉条件的惊人零样本生成能力。实验结果表明,UniControl在性能上常超越同等模型规模的单任务控制方法。这种控制通用性使UniControl成为可控视觉生成领域的重要突破。