The emergence of ChatGPT has once again sparked research in generative artificial intelligence (GAI). While people have been amazed by the generated results, they have also noticed the reasoning potential reflected in the generated textual content. However, this current ability for causal reasoning is primarily limited to the domain of language generation, such as in models like GPT-3. In visual modality, there is currently no equivalent research. Considering causal reasoning in visual content generation is significant. This is because visual information contains infinite granularity. Particularly, images can provide more intuitive and specific demonstrations for certain reasoning tasks, especially when compared to coarse-grained text. Hence, we propose a new image generation task called visual question answering with image (VQAI) and establish a dataset of the same name based on the classic \textit{Tom and Jerry} animated series. Additionally, we develop a new paradigm for image generation to tackle the challenges of this task. Finally, we perform extensive experiments and analyses, including visualizations of the generated content and discussions on the potentials and limitations. The code and data are publicly available under the license of CC BY-NC-SA 4.0 for academic and non-commercial usage. The code and dataset are publicly available at: https://github.com/IEIT-AGI/MIX-Shannon/blob/main/projects/VQAI/lgd_vqai.md.
翻译:ChatGPT的出现再次引发了生成式人工智能(GAI)的研究热潮。人们对生成结果感到惊叹的同时,也注意到了生成文本内容中蕴含的推理潜力。然而,这种因果推理能力目前主要局限于语言生成领域,例如GPT-3等模型。在视觉模态中,目前尚无类似研究。在视觉内容生成中考虑因果推理具有重要意义,因为视觉信息包含无限粒度。特别是,与粗粒度文本相比,图像能够为某些推理任务提供更直观、更具体的示例。因此,我们提出了一项名为“基于图像的视觉问答(VQAI)”的新图像生成任务,并基于经典动画片《猫和老鼠》建立了同名数据集。此外,我们开发了一种新的图像生成范式以应对该任务的挑战。最后,我们进行了广泛的实验与分析,包括生成内容可视化及潜力与局限性的讨论。代码和数据在CC BY-NC-SA 4.0许可下公开,仅供学术和非商业用途。代码与数据集公开访问地址:https://github.com/IEIT-AGI/MIX-Shannon/blob/main/projects/VQAI/lgd_vqai.md。