We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI's GPT-3.5 Turbo and Meta's Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field.
翻译:我们提出了一种新颖的框架,利用视觉问答模型实现LLM生成数据可视化效果的自动化评估。传统评估方法通常依赖人工判断,成本高昂且难以扩展,或仅关注数据准确性而忽视了视觉传达的有效性。通过采用VQA模型,我们能够评估数据表征质量及图表的整体传达清晰度。实验使用两个领先的VQA基准数据集ChartQA和PlotQA进行,可视化图表由OpenAI的GPT-3.5 Turbo及Meta的Llama 3.1 70B-Instruct模型生成。实验结果表明,基于VQA性能指标,LLM生成的图表在准确性上未能达到原始非LLM生成图表的水平。此外,虽然我们的研究证明少量示例提示能显著提升图表生成的准确率,但要使LLM完全达到人工绘制图表的精度仍需取得重大进展。这凸显了我们工作的重要性:通过无需人工标注的快速迭代机制加速研究进程,从而推动该领域的快速发展。