In this work, we present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. Simultaneously generating images and texts typically results in performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this challenge, existing approaches resort to modality-specific data for supervised fine-tuning, necessitating distinct model instances. We propose Slide-LoRA, which dynamically aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. Slide-LoRA harmonizes the generation of vision and language within a singular model instance, thereby facilitating a more unified generative process. Additionally, we develop a high-quality image caption dataset, DetailedTextCaps-100K, synthesized with a sophisticated closed-source MLLM to enhance visual text generation capabilities further. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach. Empowered by Slide-LoRA, TextHarmony achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. Our work delineates the viability of an integrated approach to multimodal generation within the visual text domain, setting a foundation for subsequent inquiries.
翻译:本文提出TextHarmony,一种统一且通用的多模态生成模型,能够熟练理解并生成视觉文本。同时生成图像和文本通常会导致性能下降,这是由于视觉与语言模态之间固有的不一致性所致。为克服这一挑战,现有方法采用模态特定数据进行监督微调,这需要不同的模型实例。我们提出Slide-LoRA,该模块动态聚合模态特定与模态无关的LoRA专家,部分解耦了多模态生成空间。Slide-LoRA在单一模型实例内协调视觉与语言的生成,从而促进了更统一的生成过程。此外,我们构建了一个高质量图像描述数据集DetailedTextCaps-100K,该数据集通过复杂的闭源多模态大语言模型合成,以进一步增强视觉文本生成能力。在多个基准测试上的综合实验证明了所提方法的有效性。借助Slide-LoRA,TextHarmony仅以2%的参数增加量即达到了与模态特定微调结果相当的性能,并在视觉文本理解任务中平均提升2.5%,在视觉文本生成任务中平均提升4.0%。我们的工作阐明了在视觉文本领域采用集成方法进行多模态生成的可行性,为后续研究奠定了基础。