Chart question answering (CQA) is a crucial area of Visual Language Understanding. However, the robustness and consistency of current Visual Language Models (VLMs) in this field remain under-explored. This paper evaluates state-of-the-art VLMs on comprehensive datasets, developed specifically for this study, encompassing diverse question categories and chart formats. We investigate two key aspects: 1) the models' ability to handle varying levels of chart and question complexity, and 2) their robustness across different visual representations of the same underlying data. Our analysis reveals significant performance variations based on question and chart types, highlighting both strengths and weaknesses of current models. Additionally, we identify areas for improvement and propose future research directions to build more robust and reliable CQA systems. This study sheds light on the limitations of current models and paves the way for future advancements in the field.
翻译:图表问答是视觉语言理解的关键领域。然而,当前视觉语言模型在该领域的鲁棒性与一致性仍未得到充分探索。本文在为本研究专门构建的综合性数据集上评估了最先进的视觉语言模型,该数据集涵盖了多样化的问题类别与图表格式。我们探究了两个关键方面:1)模型处理不同复杂程度的图表与问题的能力;2)模型对同一底层数据的不同视觉表征的鲁棒性。我们的分析揭示了模型性能随问题与图表类型而产生的显著变化,凸显了当前模型的优势与不足。此外,我们指出了有待改进的方向,并提出了未来研究路径,以构建更鲁棒、更可靠的图表问答系统。本研究揭示了当前模型的局限性,并为该领域的未来发展铺平了道路。