This paper presents multimodal markup document models (MarkupDM) that can generate both markup language and images within interleaved multimodal documents. Unlike existing vision-and-language multimodal models, our MarkupDM tackles unique challenges critical to graphic design tasks: generating partial images that contribute to the overall appearance, often involving transparency and varying sizes, and understanding the syntax and semantics of markup languages, which play a fundamental role as a representational format of graphic designs. To address these challenges, we design an image quantizer to tokenize images of diverse sizes with transparency and modify a code language model to process markup languages and incorporate image modalities. We provide in-depth evaluations of our approach on three graphic design completion tasks: generating missing attribute values, images, and texts in graphic design templates. Results corroborate the effectiveness of our MarkupDM for graphic design tasks. We also discuss the strengths and weaknesses in detail, providing insights for future research on multimodal document generation.
翻译:本文提出了多模态标记文档模型(MarkupDM),该模型能够在交错的模态文档中同时生成标记语言和图像。与现有的视觉-语言多模态模型不同,我们的MarkupDM解决了对平面设计任务至关重要的独特挑战:生成对整体外观有贡献的部分图像(这些图像通常涉及透明度和不同尺寸),以及理解标记语言的语法和语义(标记语言作为平面设计的表示格式发挥着基础性作用)。为应对这些挑战,我们设计了一个图像量化器,用于对具有透明度且尺寸各异的图像进行标记化处理,并修改了一个代码语言模型,使其能够处理标记语言并整合图像模态。我们在三个平面设计补全任务上对我们的方法进行了深入评估:生成平面设计模板中缺失的属性值、图像和文本。结果证实了我们的MarkupDM在平面设计任务中的有效性。我们还详细讨论了其优势和不足,为未来多模态文档生成的研究提供了见解。