Recent advances in Large Vision-Language Models (LVLMs) have significantly improve performance in image comprehension tasks, such as formatted charts and rich-content images. Yet, Graphical User Interface (GUI) pose a greater challenge due to their structured format and detailed textual information. Existing LVLMs often overly depend on internal knowledge and neglect image content, resulting in hallucinations and incorrect responses in GUI comprehension. To address these issues, we introduce VGA, a fine-tuned model designed for comprehensive GUI understanding. Our model aims to enhance the interpretation of visual data of GUI and reduce hallucinations. We first construct a Vision Question Answering (VQA) dataset of 63.8k high-quality examples with our propose Referent Method, which ensures the model's responses are highly depend on visual content within the image. We then design a two-stage fine-tuning method called Foundation and Advanced Comprehension (FAC) to enhance both the model's ability to extract information from image content and alignment with human intent. Experiments show that our approach enhances the model's ability to extract information from images and achieves state-of-the-art results in GUI understanding tasks. Our dataset and fine-tuning script will be released soon.
翻译:大型视觉语言模型(LVLM)的最新进展显著提升了图像理解任务(如格式化图表和富内容图像)的性能。然而,图形用户界面(GUI)因其结构化格式和详细的文本信息带来了更大的挑战。现有的LVLM往往过度依赖内部知识而忽视图像内容,导致在GUI理解中出现幻觉和错误响应。为解决这些问题,我们提出了VGA,一个专为全面GUI理解而设计的微调模型。我们的模型旨在增强对GUI视觉数据的解释并减少幻觉。我们首先通过提出的参照方法构建了一个包含63.8k个高质量样本的视觉问答(VQA)数据集,该方法确保模型的响应高度依赖于图像内的视觉内容。随后,我们设计了一种称为基础与高级理解(FAC)的两阶段微调方法,以同时提升模型从图像内容中提取信息的能力以及与人类意图的对齐度。实验表明,我们的方法增强了模型从图像中提取信息的能力,并在GUI理解任务中取得了最先进的结果。我们的数据集和微调脚本即将发布。