Large Language Models (LLMs) have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text and images together is essential, remains a substantial challenge. To address this, we introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion Attentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel evaluation framework that isolates the models' ability in pronoun disambiguation from other visual processing challenges. Evaluation of successive model versions reveals that, despite incremental advancements, Stable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally surpassing random guessing. Further error analysis identifies important areas for future research aimed at advancing text-to-image models in their ability to interpret and interact with the complex visual world.
翻译:大型语言模型(LLM)在Winograd模式挑战(WSC)等任务中展现出卓越性能,体现了其先进的文本常识推理能力。然而,将此类推理应用于需要同时理解文本与图像的多模态领域,仍面临重大挑战。为此,我们提出了WinoVis——一个专门设计用于探究多模态语境下文本到图像模型代词消歧能力的新型数据集。通过采用GPT-4进行提示生成,并利用扩散注意力归因图(DAAM)进行热力图分析,我们构建了一种新颖的评估框架,该框架能将模型的代词消歧能力与其他视觉处理挑战相隔离。对连续模型版本的评估表明:尽管存在渐进式改进,Stable Diffusion 2.0在WinoVis数据集上的精确率仅为56.7%,仅略微超过随机猜测水平。进一步的错误分析揭示了未来研究的重要方向,旨在提升文本到图像模型在理解和交互复杂视觉世界方面的能力。