GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

翻译：开源原生GUI智能体在长视野导航任务上仍落后于闭源系统。这一差距源于两个局限：高质量、动作对齐的推理数据匮乏，以及直接采用通用的后训练流程而忽视了GUI智能体特有的挑战。我们在这些流程中识别出两个根本问题：（i）采用思维链推理的标准监督微调往往会损害基础定位能力；（ii）逐步式RLVR风格训练面临部分可验证性问题，即多个动作可能都正确，但验证时仅使用单个演示动作。这使得离线逐步式指标对在线任务成功率的预测能力较弱。本文提出GUI-Libra，一种针对性的训练方案以应对这些挑战。首先，为缓解动作对齐推理数据的稀缺，我们引入了数据构建与过滤流程，并发布了一个精选的81K GUI推理数据集。其次，为协调推理与基础定位，我们提出动作感知监督微调，混合“推理后行动”与“直接行动”数据，并通过重新加权词元以强调动作与基础定位。第三，为在部分可验证性下稳定强化学习，我们指出了RLVR中被忽视的KL正则化重要性，并证明KL信任区域对于提升离线到在线预测能力至关重要；我们进一步引入成功自适应缩放，以降低不可靠负梯度的权重。在多样化的网页与移动端基准测试中，GUI-Libra持续提升了逐步准确率与端到端任务完成率。我们的结果表明，精心设计的后训练与数据策展能够显著释放更强的任务解决能力，而无需昂贵的在线数据收集。我们发布数据集、代码与模型，以促进面向具备推理能力的GUI智能体的数据高效后训练的进一步研究。