VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning

Mobile task automation is an emerging field that leverages AI to streamline and optimize the execution of routine tasks on mobile devices, thereby enhancing efficiency and productivity. Traditional methods, such as Programming By Demonstration (PBD), are limited due to their dependence on predefined tasks and susceptibility to app updates. Recent advancements have utilized the view hierarchy to collect UI information and employed Large Language Models (LLM) to enhance task automation. However, view hierarchies have accessibility issues and face potential problems like missing object descriptions or misaligned structures. This paper introduces VisionTasker, a two-stage framework combining vision-based UI understanding and LLM task planning, for mobile task automation in a step-by-step manner. VisionTasker firstly converts a UI screenshot into natural language interpretations using a vision-based UI understanding approach, eliminating the need for view hierarchies. Secondly, it adopts a step-by-step task planning method, presenting one interface at a time to the LLM. The LLM then identifies relevant elements within the interface and determines the next action, enhancing accuracy and practicality. Extensive experiments show that VisionTasker outperforms previous methods, providing effective UI representations across four datasets. Additionally, in automating 147 real-world tasks on an Android smartphone, VisionTasker demonstrates advantages over humans in tasks where humans show unfamiliarity and shows significant improvements when integrated with the PBD mechanism. VisionTasker is open-source and available at https://github.com/AkimotoAyako/VisionTasker.

翻译：移动任务自动化是一个新兴领域，它利用人工智能简化和优化移动设备上常规任务的执行，从而提高效率和生产力。传统方法（如演示编程）因依赖预定义任务且易受应用更新影响而受限。近期研究利用视图层次结构收集界面信息，并采用大语言模型增强任务自动化能力。然而，视图层次结构存在可访问性问题，且面临对象描述缺失或结构错位等潜在缺陷。本文提出VisionTasker——一个结合视觉界面理解与LLM任务规划的双阶段框架，用于实现渐进式移动任务自动化。VisionTasker首先通过视觉界面理解方法将界面截图转化为自然语言描述，从而摆脱对视图层次结构的依赖；其次采用渐进式任务规划策略，每次仅向LLM呈现单一界面。LLM随后识别界面中的相关元素并确定下一步操作，从而提升准确性与实用性。大量实验表明，VisionTasker在四个数据集上均能提供有效的界面表征，性能优于现有方法。此外，在安卓智能手机上自动化执行147个现实任务时，VisionTasker在人类不熟悉的任务中展现出超越人工的优势，与演示编程机制结合后更实现显著性能提升。本项目已开源，代码地址：https://github.com/AkimotoAyako/VisionTasker。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日