True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.

翻译：本研究探究多模态大型语言模型（LLMs）识别与解读误导性可视化，并理解这些观察结果及其根本原因与潜在意图性的能力。我们的分析借鉴了可视化修辞学的概念，以及新开发的作者意图分类体系，以此作为解释框架。我们提出了三个研究问题，并通过实验进行解答：实验采用包含2336条与COVID-19相关推文的数据集（其中一半包含误导性可视化），并辅以来自VisLies的真实案例（VisLies是IEEE VIS社区专门展示欺骗性与误导性可视化作品的特别活动），涵盖感知性、认知性与概念性错误。为确保覆盖当前大语言模型领域的广泛性，我们评估了16个最先进的模型。其中15个为开放权重模型，涵盖不同模型规模、架构系列与推理能力。所选模型包括：小型模型——Nemotron-Nano-V2-VL（12B参数）、Mistral-Small-3.2（24B）、DeepSeek-VL2（27B）、Gemma3（27B）与GTA1（32B）；中型模型——Qianfan-VL（70B）、Molmo（72B）、GLM-4.5V（108B）、LLaVA-NeXT（110B）与Pixtral-Large（124B）；以及大型模型——Qwen3-VL（235B）、InternVL3.5（241B）、Step3（321B）、Llama-4-Maverick（400B）与Kimi-K2.5（1000B）。此外，我们还使用了前沿专有模型OpenAI GPT-5.4。为建立人类视角的基准，我们邀请可视化专家进行用户研究，评估人类如何感知相同误导性可视化中的修辞手法与作者意图。这使我们能够对比模型与专家的行为，揭示其异同，从而深入理解大语言模型在哪些方面与人类判断一致，又在哪些方面存在分歧。