With the widespread adoption of Graphical User Interface (GUI) agents for automating GUI interaction tasks, substantial research focused on improving GUI perception to ground task instructions into concrete action steps. However, the step execution capability of these agents has gradually emerged as a new bottleneck for task completion. In particular, existing GUI agents often adopt overly simplified strategies for handling swipe interactions, preventing them from accurately replicating human-like behavior. To address this limitation, we decompose human swipe gestures into multiple quantifiable dimensions and propose an automated pipeline SwipeGen to synthesize human-like swipe interactions through GUI exploration. Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents. Furthermore, leveraging the synthesized data, we propose GUISwiper, a GUI agent with enhanced interaction execution capabilities. Experimental results demonstrate that GUISwiper achieves a swipe execution accuracy of 69.07%, representing a 214% improvement over existing VLM baselines.
翻译:随着图形用户界面(Graphical User Interface, GUI)智能体在自动化GUI交互任务中的广泛应用,大量研究聚焦于改进GUI感知能力,以将任务指令落实到具体操作步骤。然而,这些智能体的步骤执行能力已逐渐成为任务完成的新瓶颈。特别是,现有GUI智能体在处理滑动交互时通常采用过于简化的策略,导致其无法准确复现类人行为。为突破此限制,我们将人类滑动手势分解为多个可量化维度,并提出一种自动化流程SwipeGen,通过GUI探索合成类人滑动交互。基于此流程,我们构建并发布了首个用于评估GUI智能体滑动执行能力的基准测试。此外,利用合成数据,我们提出了具备增强交互执行能力的GUI智能体GUISwiper。实验结果表明,GUISwiper实现了69.07%的滑动执行准确率,相较于现有VLM基线提升了214%。