Learning interpretable representations of data generative latent factors is an important topic for the development of artificial intelligence. With the rise of the large multimodal model, it can align images with text to generate answers. In this work, we propose a framework to comprehensively explain each latent variable in the generative models using a large multimodal model. We further measure the uncertainty of our generated explanations, quantitatively evaluate the performance of explanation generation among multiple large multimodal models, and qualitatively visualize the variations of each latent variable to learn the disentanglement effects of different generative models on explanations. Finally, we discuss the explanatory capabilities and limitations of state-of-the-art large multimodal models.
翻译:学习数据生成潜在因素的可解释表示是人工智能发展的重要课题。随着大型多模态模型的兴起,它能够将图像与文本对齐以生成答案。在这项工作中,我们提出一个框架,利用大型多模态模型全面解释生成模型中的每个潜在变量。我们进一步衡量所生成解释的不确定性,定量评估多个大型多模态模型在解释生成方面的性能,并定性可视化每个潜在变量的变化,以了解不同生成模型在解释上的解耦效果。最后,我们讨论了当前最先进大型多模态模型的解释能力与局限性。