LLM-Based Visualization Evaluation: How Well Do Literacy-Stratified Personas Approximate Human Judgments?

Evaluating data visualizations across diverse user populations continues to pose a significant methodological challenge within visualization research. We propose a theorized evaluation framework, Literacy-Stratified LLM Evaluation (LSLE), which formalizes a two-stage process. The first stage involves constructing visualization literacy personas grounded in established frameworks such as VLAT. The second stage directs large language models to adopt these personas as simulated evaluators of visualization artifacts. We ground the framework in an epistemic analysis that characterizes the conditions under which LLM persona simulation may produce plausible proxies for literacy-dependent perception - and, critically, the conditions under which it does not - engaging directly with emerging critiques of LLM-as-participant paradigms from the VIS and HCI literature. To empirically test LSLE's boundaries, we benchmark its outputs against openly available human response data from the validation studies of two established instruments: VLAT and BeauVIS. Using the same stimuli and assessment items as the original human studies, we compare LSLE persona responses across literacy strata against published human distributions and against default (non-persona) LLM baselines. Our analysis reveals where literacy-stratified personas converge with and diverge from human response patterns - identifying task types and evaluation dimensions where persona simulation approximates human variability and where it systematically fails. We discuss implications for the responsible use of LLM-assisted evaluation as a complement to empirical methods, and propose boundary conditions for when LSLE may be most appropriate: early-stage design exploration and rapid comparative screening rather than summative evaluation.

翻译：跨不同用户群体评估数据可视化在可视化研究中持续构成重大方法论挑战。我们提出一种理论化的评估框架——分层读写大语言模型评估（LSLE），该框架规范了包含两个阶段的流程。第一阶段基于VLAT等成熟框架构建可视化读写能力人设。第二阶段引导大语言模型采用这些人设作为可视化作品的模拟评估者。我们将该框架建立在认知分析基础上，该分析刻画了大语言模型人设模拟在何种条件下可能产生读写依赖感知的合理代理——关键在于，以及在何种条件下无法产生此类代理——直接回应VIS和HCI文献中关于"大语言模型作为参与者"范式的新兴批判。为实证检验LSLE的边界，我们将其输出结果与两个成熟工具（VLAT和BeauVIS）验证研究中公开可用的人类响应数据进行基准测试。使用与原始人类研究相同的刺激材料和评估项目，我们跨读写分层比较LSLE人设响应与已发表的人类分布数据及默认（非人设）大语言模型基线。分析揭示了分层读写人设与人类响应模式的趋同与分歧——识别出人设模拟能近似人类变异性的任务类型和评估维度，以及系统性失效的领域。我们讨论了将大语言模型辅助评估作为实证方法补充的负责任使用启示，并提出了LSLE最适用场景的边界条件：早期设计探索与快速比较筛选，而非总结性评估。