This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). We propose a framework where LLaVA analyzes input images and generates textual descriptions, hereinafter LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. We observe a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. By providing more specific details within the prompts, we aim to achieve a delicate balance between faithfulness to the original image and artistic expression in the generated outputs.
翻译:本文提出了一种新颖的方法,通过利用大型语言与视觉助手(LLaVA)的多模态能力来增强图像到图像的生成。我们提出了一个框架,其中LLaVA分析输入图像并生成文本描述,下文称为LLaVA生成的提示。这些提示与原始图像一同输入到图像到图像的生成流程中。这种丰富的表征引导生成过程产生与输入图像具有更强相似性的输出。大量实验证明了LLaVA生成提示在提升图像相似性方面的有效性。与传统方法相比,我们观察到生成图像与输入图像之间的视觉连贯性得到了显著改善。未来的工作将探索微调LLaVA提示,以增强对创作过程的控制。通过在提示中提供更具体的细节,我们旨在生成输出中,在忠实于原始图像与艺术表达之间实现微妙的平衡。