People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at \url{https://github.com/THUDM/CogVLM}.
翻译:人们通过图形用户界面(GUI,如计算机或智能手机屏幕)在数字设备上花费了大量时间。大型语言模型(LLM,如ChatGPT)虽能协助完成撰写邮件等任务,但在理解与交互GUI方面存在局限,从而限制了其提升自动化水平的潜力。本文提出CogAgent——一个专门用于GUI理解与导航的180亿参数视觉语言模型(VLM)。通过结合低分辨率与高分辨率图像编码器,CogAgent支持1120×1120分辨率输入,可识别微小页面元素与文字。作为通用视觉语言模型,CogAgent在五个文本密集型及四个通用VQA基准测试中达到当前最优水平,涵盖VQAv2、OK-VQA、Text-VQA、ST-VQA、ChartQA、InfoVQA、DocVQA、MM-Vet与POPE。在仅使用截图作为输入的情况下,CogAgent在PC与Android GUI导航任务(Mind2Web与AITW)中均优于基于LLM的HTML文本解析方法,实现了性能突破。模型与代码已开源至\url{https://github.com/THUDM/CogVLM}。