In the rapidly evolving landscape of AI research and application, Multimodal Large Language Models (MLLMs) have emerged as a transformative force, adept at interpreting and integrating information from diverse modalities such as text, images, and Graphical User Interfaces (GUIs). Despite these advancements, the nuanced interaction and understanding of GUIs pose a significant challenge, limiting the potential of existing models to enhance automation levels. To bridge this gap, this paper presents V-Zen, an innovative Multimodal Large Language Model (MLLM) meticulously crafted to revolutionise the domain of GUI understanding and grounding. Equipped with dual-resolution image encoders, V-Zen establishes new benchmarks in efficient grounding and next-action prediction, thereby laying the groundwork for self-operating computer systems. Complementing V-Zen is the GUIDE dataset, an extensive collection of real-world GUI elements and task-based sequences, serving as a catalyst for specialised fine-tuning. The successful integration of V-Zen and GUIDE marks the dawn of a new era in multimodal AI research, opening the door to intelligent, autonomous computing experiences. This paper extends an invitation to the research community to join this exciting journey, shaping the future of GUI automation. In the spirit of open science, our code, data, and model will be made publicly available, paving the way for multimodal dialogue scenarios with intricate and precise interactions.
翻译:在人工智能研究和应用快速发展的背景下,多模态大语言模型已成为一种变革性力量,擅长解读和整合来自文本、图像及图形用户界面等多种模态的信息。尽管取得了这些进展,GUI的精细交互与理解仍构成重大挑战,限制了现有模型提升自动化水平的潜力。为弥合这一差距,本文提出了V-Zen——一种精心设计的创新型多模态大语言模型,旨在彻底革新GUI理解与定位领域。V-Zen配备双分辨率图像编码器,在高效定位和下一动作预测方面建立了新基准,从而为自主运行的计算机系统奠定基础。与V-Zen配套的是GUIDE数据集,该数据集广泛收集了真实世界的GUI元素和基于任务的序列,可作为专业化微调的催化剂。V-Zen与GUIDE的成功融合标志着多模态人工智能研究新时代的开启,为智能自主计算体验打开了大门。本文诚邀研究界共同参与这一激动人心的探索之旅,共同塑造GUI自动化的未来。本着开放科学的精神,我们的代码、数据和模型将公开发布,为复杂精准交互的多模态对话场景铺平道路。