Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: https://github.com/LEON-gittech/Verl_GUI.git
翻译:视觉语言模型正越来越多地被部署为操作桌面和浏览器的计算机使用智能体。性能顶尖的计算机使用智能体是基于框架的系统,它们将规划与执行解耦;而端到端的屏幕截图到动作策略虽然更易于部署,但在OSWorld-Verified等基准测试中表现落后。像OSWorld这样的图形用户界面数据集存在两个瓶颈:它们仅暴露数百个可交互、可验证的任务和环境,并且专家轨迹必须通过与这些环境交互来收集,使得此类数据难以扩展。因此,我们探究如何利用强化学习从可验证奖励中,最优地利用一小部分现有专家轨迹来训练端到端策略。将这些离线策略轨迹简单地混入在线策略的强化学习从可验证奖励中是脆弱的:即使在格式转换后,专家轨迹仍表现出与学习器的结构不匹配和分布偏移。我们提出了BEPA(双层专家-策略同化),它通过基础策略下自展开的可达轨迹将静态专家轨迹转化为策略对齐的指导,并构建一个用于强化学习从可验证奖励中的、按任务动态更新的缓存。在OSWorld-Verified上,BEPA将UITARS1.5-7B的成功率从22.87%提升至32.13%,并将一个留出分集的成功率从5.74%提升至10.30%,同时在MMBench-GUI和Online-Mind2Web上取得了一致的性能提升。我们的代码和数据可在以下网址获取:https://github.com/LEON-gittech/Verl_GUI.git