Mobile task automation is an emerging field that leverages AI to streamline and optimize the execution of routine tasks on mobile devices, thereby enhancing efficiency and productivity. Traditional methods, such as Programming By Demonstration (PBD), are limited due to their dependence on predefined tasks and susceptibility to app updates. Recent advancements have utilized the view hierarchy to collect UI information and employed Large Language Models (LLM) to enhance task automation. However, view hierarchies have accessibility issues and face potential problems like missing object descriptions or misaligned structures. This paper introduces VisionTasker, a two-stage framework combining vision-based UI understanding and LLM task planning, for mobile task automation in a step-by-step manner. VisionTasker firstly converts a UI screenshot into natural language interpretations using a vision-based UI understanding approach, eliminating the need for view hierarchies. Secondly, it adopts a step-by-step task planning method, presenting one interface at a time to the LLM. The LLM then identifies relevant elements within the interface and determines the next action, enhancing accuracy and practicality. Extensive experiments show that VisionTasker outperforms previous methods, providing effective UI representations across four datasets. Additionally, in automating 147 real-world tasks on an Android smartphone, VisionTasker demonstrates advantages over humans in tasks where humans show unfamiliarity and shows significant improvements when integrated with the PBD mechanism. VisionTasker is open-source and available at https://github.com/AkimotoAyako/VisionTasker.
翻译:移动任务自动化是一个新兴领域,它利用人工智能简化和优化移动设备上常规任务的执行,从而提高效率和生产力。传统方法(如演示编程)因依赖预定义任务且易受应用更新影响而受限。近期研究利用视图层次结构收集界面信息,并采用大语言模型增强任务自动化能力。然而,视图层次结构存在可访问性问题,且面临对象描述缺失或结构错位等潜在缺陷。本文提出VisionTasker——一个结合视觉界面理解与LLM任务规划的双阶段框架,用于实现渐进式移动任务自动化。VisionTasker首先通过视觉界面理解方法将界面截图转化为自然语言描述,从而摆脱对视图层次结构的依赖;其次采用渐进式任务规划策略,每次仅向LLM呈现单一界面。LLM随后识别界面中的相关元素并确定下一步操作,从而提升准确性与实用性。大量实验表明,VisionTasker在四个数据集上均能提供有效的界面表征,性能优于现有方法。此外,在安卓智能手机上自动化执行147个现实任务时,VisionTasker在人类不熟悉的任务中展现出超越人工的优势,与演示编程机制结合后更实现显著性能提升。本项目已开源,代码地址:https://github.com/AkimotoAyako/VisionTasker。