When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.
翻译:将群组相对策略优化(GRPO)应用于GUI定位任务时,训练样本的展开轨迹仅从单一截图视角采样,导致困难样本组内全为失败案例而简单样本组内全为成功案例,无法产生有效的相对优势。我们提出VISTA(视图一致性自验证训练)框架——一种基于GRPO的训练框架,通过从同一GUI实例的多个目标保留视图中构建对比组。每个视图通过裁剪生成,保持目标元素可见并精确映射其边界框,使得模型展开轨迹可在语义等价但几何形态不同的输入间进行对比。为避免短坐标生成退化为无条件模仿强化学习,VISTA进一步引入自验证跨视图锚点:采用优势加权损失优化的先验答案,排除在组基线之外,且仅在模型生成最大奖励展开轨迹时激活。在五个GUI定位基准测试与多个Qwen骨干网络上,VISTA持续提升了定位准确率。在ScreenSpot-Pro数据集上,Qwen3-VL 4B/8B/30B-A3B的准确率分别从55.5/52.7/53.7提升至63.4/65.8/67.0。鲁棒性分析进一步表明其具有更高的最差视图准确率与更低的预测翻转率。