This study introduces an innovative framework designed to automate tasks by interacting with UIs through a sequential, human-like problem-solving approach. Our approach initially transforms UI screenshots into natural language explanations through a vision-based UI analysis, circumventing traditional view hierarchy limitations. It then methodically engages with each interface, guiding the LLM to pinpoint and act on relevant UI elements, thus bolstering both precision and functionality. Employing the ERNIE Bot LLM, our approach has been demonstrated to surpass existing methodologies. It delivers superior UI interpretation across various datasets and exhibits remarkable efficiency in automating varied tasks on an Android smartphone, outperforming human capabilities in intricate tasks and significantly enhancing the PBD process.
翻译:本研究提出了一种创新框架,旨在通过类人顺序问题求解方式与用户界面交互,实现任务自动化。该方法首先通过基于视觉的界面分析将UI截图转化为自然语言描述,从而规避传统视图层级结构的局限性。随后,系统逐步与各界面进行交互,引导大语言模型精准定位并操作相关UI元素,从而提升精度与功能完整性。基于ERNIE Bot大语言模型的实验表明,该方法在多个数据集上的UI理解能力显著超越现有技术,并在Android智能手机上展现出高效的任务自动化能力——在复杂任务中甚至超越人类表现,极大增强了基于示范编程的效率。