Do Large Language Models Understand Data Visualization Rules?

Data visualization rules-derived from decades of research in design and perception-ensure trustworthy chart communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they can reason about and enforce visualization rules directly. Constraint-based systems such as Draco encode these rules as logical constraints for precise automated checks, but maintaining symbolic encodings requires expert effort, motivating the use of LLMs as flexible rule validators. In this paper, we present the first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP). We translated a subset of Draco's constraints into natural-language statements and generated a controlled dataset of 2,000 Vega-Lite specifications annotated with explicit rule violations. LLMs were evaluated on both accuracy in detecting violations and prompt adherence, which measures whether outputs follow the required structured format. Results show that frontier models achieve high adherence (Gemma 3 4B / 27B: 100%, GPT-oss 20B: 98%) and reliably detect common violations (F1 up to 0.82),yet performance drops for subtler perceptual rules (F1 < 0.15 for some categories) and for outputs generated from technical ASP formulations.Translating constraints into natural language improved performance by up to 150% for smaller models. These findings demonstrate the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.

翻译：数据可视化规则源自设计与感知领域数十年的研究，能够确保图表传达的可信度。尽管已有研究表明大型语言模型（LLMs）能够生成图表或识别误导性图形，但其是否能够直接推理并执行可视化规则仍不明确。基于约束的系统（如Draco）将这些规则编码为逻辑约束以实现精确的自动化检查，但维护符号编码需要专家投入，这促使我们探索将LLMs用作灵活规则验证器的可能性。本文首次基于从答案集编程（ASP）导出的硬验证基准，对LLMs在可视化规则理解方面进行了系统评估。我们将Draco约束的子集转化为自然语言陈述，并构建了一个包含2000个标注了显式规则违反的Vega-Lite规范的控制数据集。评估LLMs时，既考察其检测违规的准确性，也衡量其提示遵循度——即输出是否符合规定的结构化格式。结果显示，前沿模型在提示遵循度上表现优异（Gemma 3 4B/27B：100%，GPT-oss 20B：98%），并能可靠检测常见违规（F1最高达0.82）；然而，在面对更细微的感知规则（部分类别的F1低于0.15）以及从技术性ASP表述生成的输出时，其性能显著下降。将约束转化为自然语言使较小模型的性能提升最高达150%。这些发现证明了LLMs作为灵活的语言驱动验证器的潜力，同时也揭示了其当前相对于符号求解器的局限性。