VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning

Recent advances in Large Vision-Language Models (LVLMs) have significantly improve performance in image comprehension tasks, such as formatted charts and rich-content images. Yet, Graphical User Interface (GUI) pose a greater challenge due to their structured format and detailed textual information. Existing LVLMs often overly depend on internal knowledge and neglect image content, resulting in hallucinations and incorrect responses in GUI comprehension. To address these issues, we introduce VGA, a fine-tuned model designed for comprehensive GUI understanding. Our model aims to enhance the interpretation of visual data of GUI and reduce hallucinations. We first construct a Vision Question Answering (VQA) dataset of 63.8k high-quality examples with our propose Referent Method, which ensures the model's responses are highly depend on visual content within the image. We then design a two-stage fine-tuning method called Foundation and Advanced Comprehension (FAC) to enhance both the model's ability to extract information from image content and alignment with human intent. Experiments show that our approach enhances the model's ability to extract information from images and achieves state-of-the-art results in GUI understanding tasks. Our dataset and fine-tuning script will be released soon.

翻译：大型视觉语言模型（LVLM）的最新进展显著提升了图像理解任务（如格式化图表和富内容图像）的性能。然而，图形用户界面（GUI）因其结构化格式和详细的文本信息带来了更大的挑战。现有的LVLM往往过度依赖内部知识而忽视图像内容，导致在GUI理解中出现幻觉和错误响应。为解决这些问题，我们提出了VGA，一个专为全面GUI理解而设计的微调模型。我们的模型旨在增强对GUI视觉数据的解释并减少幻觉。我们首先通过提出的参照方法构建了一个包含63.8k个高质量样本的视觉问答（VQA）数据集，该方法确保模型的响应高度依赖于图像内的视觉内容。随后，我们设计了一种称为基础与高级理解（FAC）的两阶段微调方法，以同时提升模型从图像内容中提取信息的能力以及与人类意图的对齐度。实验表明，我们的方法增强了模型从图像中提取信息的能力，并在GUI理解任务中取得了最先进的结果。我们的数据集和微调脚本即将发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日