The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a continuous reward function to incentivize high-precision grounding; 2) a ``Simple Thinking'' reward to balance planning with speed and grounding accuracy; and 3) a cropping-based resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present decomposed grounding with selection to dramatically improve grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art grounding performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2 while it also exhibits strong general agent capabilities. For instance, using both our training and inference enhancement methods brings 23\% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. We provide the code in https://github.com/KDEGroup/UI-AGILE.
翻译:摘要:多模态大语言模型的涌现推动了图形用户界面智能体能力的重大进展。然而,现有图形用户界面智能体的训练与推理技术仍面临推理设计困境、奖励无效及视觉噪声等问题。为解决上述挑战,我们提出UI-AGILE框架,在训练与推理两个阶段增强图形用户界面智能体性能。在训练环节,我们提出了一系列监督微调过程的改进方案:1) 连续奖励函数以激励高精度基元定位;2) "简易思维"奖励以平衡规划速度与基元定位精度;3) 基于裁剪的重采样策略以缓解稀疏奖励问题并提升复杂任务的学习效果。在推理环节,我们提出分解式基元定位与选择方法,通过将图像分割为更小的可处理区域,显著提升高分辨率显示场景下的基元定位精度。实验表明,UI-AGILE在ScreenSpot-Pro与ScreenSpot-v2两个基准测试中达到最先进的基元定位性能,同时展现出强大的通用智能体能力。例如,结合我们的训练与推理增强方法,在ScreenSpot-Pro上相比最优基线实现了23%的基元定位精度提升。相关代码已发布于https://github.com/KDEGroup/UI-AGILE。