Large Language Models (LLMs) have garnered significant attention for their advancements in natural language processing, demonstrating unparalleled prowess in text comprehension and generation. Yet, the simultaneous generation of images with coherent textual narratives remains an evolving frontier. In response, we introduce an innovative interleaved vision-and-language generation technique anchored by the concept of "generative vokens," acting as the bridge for harmonized image-text outputs. Our approach is characterized by a distinctive two-staged training strategy focusing on description-free multimodal generation, where the training requires no comprehensive descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, enhancing the effectiveness of vokens on image generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks.
翻译:大型语言模型(LLMs)因其在自然语言处理领域的突破性进展而广受关注,在文本理解与生成方面展现出无与伦比的能力。然而,如何实现与连贯文本叙事同步的图像生成仍是一个不断演进的前沿课题。为此,我们提出了一种创新性的交错视觉-语言生成技术,该技术以"生成式视觉词元"为核心概念,为协调图文输出架起桥梁。我们的方法采用独特的两阶段训练策略,专注于无描述的多模态生成——训练过程无需图像的完整描述。为增强模型完整性,我们引入了无分类器引导机制,有效提升了视觉词元对图像生成的效果。我们的模型MiniGPT-5在MMDialog数据集上相较基线Divter模型实现了显著改进,同时在VIST数据集的人工评估中持续展现出优越或相当的多模态输出性能,充分验证了其跨多元基准的有效性。