Large Language Models (LLMs) have garnered significant attention for their advancements in natural language processing, demonstrating unparalleled prowess in text comprehension and generation. Yet, the simultaneous generation of images with coherent textual narratives remains an evolving frontier. In response, we introduce an innovative interleaved vision-and-language generation technique anchored by the concept of "generative vokens," acting as the bridge for harmonized image-text outputs. Our approach is characterized by a distinctive two-staged training strategy focusing on description-free multimodal generation, where the training requires no comprehensive descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, enhancing the effectiveness of vokens on image generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks.
翻译:大型语言模型(LLMs)凭借其在自然语言处理领域的卓越进展备受关注,展现出无与伦比的文本理解与生成能力。然而,在生成连贯文本叙述的同时同步生成图像,仍是一个持续演进的前沿课题。为此,本文提出一种创新的交错视觉-语言生成技术,其核心基于"生成式视觉词元"概念,作为协调图像-文本输出的桥梁。该方法采用独特的两阶段训练策略,聚焦于无描述多模态生成——训练过程无需图像的详细描述。为增强模型完整性,我们引入无分类器引导技术,提升了视觉词元对图像生成的有效性。我们的模型MiniGPT-5在MMDialog数据集上相较基线Divter模型取得了显著改进,并在VIST数据集的人工评估中持续生成优于或相当的多模态输出,充分验证了其在多样化基准任务中的有效性。