UM-Text: A Unified Multimodal Model for Image Understanding

With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

翻译：随着图像生成技术的快速发展，利用自然语言指令进行视觉文本编辑日益受到关注。该任务的主要挑战在于充分理解指令与参考图像，从而生成与图像风格一致的视觉文本。现有方法通常涉及指定文本内容及属性（如字体大小、颜色和布局）的复杂步骤，而未充分考虑与参考图像的风格一致性。为此，我们提出UM-Text——一种通过自然语言指令进行上下文理解与视觉文本编辑的多模态统一模型。具体而言，我们引入视觉语言模型（VLM）来处理指令与参考图像，使得文本内容与布局能够依据上下文信息进行精细设计。为生成准确且和谐的视觉文本图像，我们进一步提出UM-Encoder以融合多种条件信息的嵌入表示，其融合方式由VLM根据输入指令自动配置。在训练过程中，我们提出区域一致性损失函数，为潜在空间与RGB空间上的字形生成提供更有效的监督，并设计定制化的三阶段训练策略以进一步提升模型性能。此外，我们构建了UM-DATA-200K——一个涵盖多样化场景的大规模视觉文本图像数据集用于模型训练。在多个公开基准测试上的大量定性与定量结果表明，本方法取得了最先进的性能。

相关内容

澳门大学

关注 0

澳门大学简称澳大、UM、UMAC，(英语：University of Macau，葡语：Universidade de Macau)是澳门第一所现代大学，也是最具代表性的一所公立大学。中国澳门的一所文理科综合高等学府。澳门唯一的公立综合性大学。现地位于横琴（澳门大学大马路，旧校址位于澳门氹仔徐日昇寅公马路）。

统一的多模态文字理解与生成大模型

专知会员服务

30+阅读 · 2024年10月11日

不可错过！CMU最新《生成式人工智能大模型》课程：从文本、图像到多模态大模型

专知会员服务

58+阅读 · 2024年9月29日

复旦最新《基于文本到图像扩散模型的多模态引导图像编辑》综述

专知会员服务

16+阅读 · 2024年6月21日

使用多模态语言模型生成图像

专知会员服务

32+阅读 · 2023年8月23日