Reinforcement Learning (RL) has emerged as a mainstream paradigm for training Mobile GUI Agents, yet it struggles with the temporal credit assignment problem inherent in long-horizon tasks. A primary challenge lies in the trade-off between reward fidelity and density: outcome reward offers high fidelity but suffers from signal sparsity, while process reward provides dense supervision but remains prone to bias and reward hacking. To resolve this conflict, we propose the Adaptive Milestone Reward (ADMIRE) mechanism. ADMIRE constructs a verifiable, adaptive reward system by anchoring trajectory to milestones, which are dynamically distilled from successful explorations. Crucially, ADMIRE integrates an asymmetric credit assignment strategy that denoises successful trajectories and scaffolds failed trajectories. Extensive experiments demonstrate that ADMIRE consistently yields over 10% absolute improvement in success rate across different base models on AndroidWorld. Moreover, the method exhibits robust generalizability, achieving strong performance across diverse RL algorithms and heterogeneous environments such as web navigation and embodied tasks.
翻译:强化学习已成为训练移动图形用户界面智能体的主流范式,但在处理长时程任务固有的时序信用分配问题时仍面临挑战。核心矛盾在于奖励保真度与密度之间的权衡:结果奖励具有高保真度但存在信号稀疏性问题,而过程奖励能提供密集监督却易产生偏差和奖励破解现象。为解决这一矛盾,我们提出自适应里程碑奖励机制。该机制通过将轨迹锚定至里程碑构建可验证的自适应奖励系统,这些里程碑从成功探索中动态提炼而成。关键创新在于集成了一种非对称信用分配策略,该策略能对成功轨迹进行去噪处理,并对失败轨迹提供结构化支撑。大量实验表明,在AndroidWorld平台上,该机制在不同基础模型上均能实现超过10%的绝对成功率提升。此外,该方法展现出强大的泛化能力,在网页导航和具身任务等多种强化学习算法与异构环境中均能取得优异性能。