Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon tasks. Building a world model from repeated interactions enables forecasting action outcomes and supports better decision making for mobile GUI agents. This is challenging because the model must predict post-action states with spatial awareness while remaining efficient enough for practical deployment. In this paper, we propose MobileDreamer, an efficient world-model-based lookahead framework to equip the GUI agents based on the future imagination provided by the world model. It consists of textual sketch world model and rollout imagination for GUI agent. Textual sketch world model forecasts post-action states through a learning process to transform digital images into key task-related sketches, and designs a novel order-invariant learning strategy to preserve the spatial information of GUI elements. The rollout imagination strategy for GUI agent optimizes the action-selection process by leveraging the prediction capability of world model. Experiments on Android World show that MobileDreamer achieves state-of-the-art performance and improves task success by 5.25%. World model evaluations further verify that our textual sketch modeling accurately forecasts key GUI elements.
翻译:移动图形用户界面(GUI)代理在现实世界自动化和实际应用中展现出强大潜力。然而,现有代理大多仍处于反应式状态,主要依据当前屏幕信息进行决策,这限制了其在长周期任务上的性能表现。通过重复交互构建世界模型,能够预测行动结果并支持移动GUI代理做出更优决策。这一任务具有挑战性,因为模型必须在保持空间感知能力预测行动后状态的同时,确保足够高效以满足实际部署需求。本文提出MobileDreamer——一种基于世界模型的高效前瞻框架,通过世界模型提供的未来想象能力赋能GUI代理。该框架包含文本草图世界模型与GUI代理的推演想象模块。文本草图世界模型通过学习过程将数字图像转换为关键任务相关草图以预测行动后状态,并设计了一种新颖的顺序无关学习策略以保持GUI元素的空间信息。针对GUI代理的推演想象策略则通过利用世界模型的预测能力优化行动选择过程。在Android World上的实验表明,MobileDreamer实现了最先进的性能,任务成功率提升5.25%。世界模型评估进一步验证了本文文本草图建模方法能够准确预测关键GUI元素。