GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.
翻译:图形用户界面流程自动化(GPA)是一种轻量级但通用的基于视觉的机器人流程自动化(RPA),它仅需一次演示即可实现快速、稳定的流程回放。为解决传统RPA的脆弱性以及当前基于视觉语言模型的图形用户界面智能体的非确定性风险,GPA引入了三大核心优势:(1) 基于序贯蒙特卡洛定位的鲁棒性,以处理缩放和检测的不确定性;(2) 通过就绪校准确保的确定性与可靠性;以及(3) 通过快速、完全本地执行保障的隐私性。该方法为企业工作流提供了所需的适应性、鲁棒性和安全性。它还可作为MCP/CLI工具供具备编码能力的其他智能体使用,使智能体仅负责推理与编排,而GPA处理图形用户界面执行。我们进行了一项初步实验,将GPA与Gemini 3 Pro(配备CUA工具)进行比较,发现GPA在完成长周期图形用户界面任务时成功率更高,执行速度快10倍。