Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.
翻译:近年来,多模态大语言模型(MLLMs)的显著进展极大地推动了面向图形用户界面(GUI)的自主智能体的发展。然而,在实际应用中,GUI智能体常面临非平稳环境,导致数据整理与策略优化的计算成本高昂。本报告提出一种以MLLM为核心的新型GUI智能体框架,该框架包含两个组成部分:agentic-Q估计与分步策略优化。前者旨在优化一个能够生成分步价值以评估给定动作对任务完成贡献度的Q模型;后者以状态-动作轨迹中的分步样本作为输入,并通过结合我们提出的agentic-Q模型进行强化学习来优化策略。需要特别指出的是:(i)所有状态-动作轨迹均由策略自身生成,从而使得数据收集成本可控;(ii)策略更新与环境解耦,确保了优化过程的稳定与高效。实证评估表明,我们的框架使Ovis2.5-9B具备了强大的GUI交互能力,在GUI导航与基础任务基准测试中取得了显著性能,甚至超越了规模更大的竞争模型。