Recently introduced ControlNet has the ability to steer the text-driven image generation process with geometric input such as human 2D pose, or edge features. While ControlNet provides control over the geometric form of the instances in the generated image, it lacks the capability to dictate the visual appearance of each instance. We present FineControlNet to provide fine control over each instance's appearance while maintaining the precise pose control capability. Specifically, we develop and demonstrate FineControlNet with geometric control via human pose images and appearance control via instance-level text prompts. The spatial alignment of instance-specific text prompts and 2D poses in latent space enables the fine control capabilities of FineControlNet. We evaluate the performance of FineControlNet with rigorous comparison against state-of-the-art pose-conditioned text-to-image diffusion models. FineControlNet achieves superior performance in generating images that follow the user-provided instance-specific text prompts and poses compared with existing methods. Project webpage: https://samsunglabs.github.io/FineControlNet-project-page
翻译:近期提出的ControlNet能够利用人体2D姿态或边缘特征等几何输入引导文本驱动图像生成过程。虽然ControlNet可控制生成图像中实例的几何形态,但缺乏对每个实例视觉外观的指定能力。我们提出FineControlNet,在保持精确姿态控制能力的同时实现每个实例外观的精细控制。具体而言,我们通过人体姿态图像进行几何控制,并利用实例级文本提示实现外观控制,开发并展示了FineControlNet。潜在空间中实例特定文本提示与2D姿态的空间对齐赋予FineControlNet精细控制能力。通过与最先进的姿态条件文本到图像扩散模型进行严格对比,我们评估了FineControlNet的性能。实验表明,相较于现有方法,FineControlNet在生成遵循用户提供的实例级文本提示与姿态的图像方面展现出更优性能。项目网页:https://samsunglabs.github.io/FineControlNet-project-page