Large-scale text-to-image generative models have made impressive strides, showcasing their ability to synthesize a vast array of high-quality images. However, adapting these models for artistic image editing presents two significant challenges. Firstly, users struggle to craft textual prompts that meticulously detail visual elements of the input image. Secondly, prevalent models, when effecting modifications in specific zones, frequently disrupt the overall artistic style, complicating the attainment of cohesive and aesthetically unified artworks. To surmount these obstacles, we build the innovative unified framework CreativeSynth, which is based on a diffusion model with the ability to coordinate multimodal inputs and multitask in the field of artistic image generation. By integrating multimodal features with customized attention mechanisms, CreativeSynth facilitates the importation of real-world semantic content into the domain of art through inversion and real-time style transfer. This allows for the precise manipulation of image style and content while maintaining the integrity of the original model parameters. Rigorous qualitative and quantitative evaluations underscore that CreativeSynth excels in enhancing artistic images' fidelity and preserves their innate aesthetic essence. By bridging the gap between generative models and artistic finesse, CreativeSynth becomes a custom digital palette.
翻译:大规模文本到图像生成模型已取得显著进展,展现出合成海量高质量图像的能力。然而,将这些模型应用于艺术图像编辑面临两大挑战:首先,用户难以构建精确描述输入图像视觉元素的文本提示;其次,现有模型在针对特定区域进行修改时,常破坏整体艺术风格,难以实现协调统一的美学效果。为克服这些障碍,我们构建了创新型统一框架CreativeSynth,该框架基于具有多模态输入协调与艺术图像生成多任务能力的扩散模型。通过将多模态特征与定制化注意力机制相结合,CreativeSynth利用反演与实时风格迁移技术,将现实世界的语义内容导入艺术领域。这使得在对图像风格与内容进行精准操控的同时,保持原始模型参数的完整性。严格的定性与定量评估表明,CreativeSynth在提升艺术图像保真度的同时,能保留其固有关学意蕴。通过弥合生成模型与艺术精粹之间的鸿沟,CreativeSynth成为定制化的数字调色板。