Previous text-to-image synthesis algorithms typically use explicit textual instructions to generate/manipulate images accurately, but they have difficulty adapting to guidance in the form of coarsely matched texts. In this work, we attempt to stylize an input image using such coarsely matched text as guidance. To tackle this new problem, we introduce a novel task called text-based style generation and propose a two-stage generative adversarial network: the first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature, which is produced by a multi-modality style synthesis module. We re-filter one existing dataset and collect a new dataset for the task. Extensive experiments and ablation studies are conducted to validate our framework. The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization. Our datasets are published at https://www.kaggle.com/datasets/mengyaocui/style-generation.
翻译:先前基于文本到图像的合成算法通常依赖明确的文本指令来精确生成/操控图像,但难以适应粗略匹配文本形式的引导。本研究尝试以此类粗略匹配文本作为引导,对输入图像进行风格化处理。为解决这一新问题,我们提出了名为"基于文本的风格生成"的新任务,并设计了双阶段生成对抗网络:第一阶段利用句子特征生成整体图像风格,第二阶段通过多模态风格合成模块产生的合成特征对生成风格进行细化。我们针对该任务重新筛选了现有数据集并构建了新数据集。通过大量实验与消融研究验证了框架有效性,并在文本-图像对齐、故事可视化等应用场景中展示了其实际应用潜力。我们的数据集已发布在https://www.kaggle.com/datasets/mengyaocui/style-generation。