On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their adoption is hindered by the perception of poor performance, as even the best models (e.g. Qwen3-VL-235B) scores are capped at around 60% on benchmarks like AndroidControl, far from viability for real-world use. Our research reveals that issue lies not only with the models but with the benchmarks themselves. We identified notable shortcomings in AndroidControl, including ambiguities and factual errors, which systematically underrates agent capabilities. To address this critical oversight, we enhanced AndroidControl into AndroidControl-Curated, a refined version of the benchmark improved through a rigorous purification pipeline. On this enhanced benchmark, state-of-the-art models achieve success rates nearing 75% on complex tasks (15% improvement), reflecting that on-device GUI agents are actually closer to practical deployment than previously thought. We introduce our new SOTA model, Magma-R1- 3B, post-trained on just 2.4k curated samples using 60 hours of an H20 GPU (approximately $60). Despite being 200 times smaller in parameters, this model delivers performance comparable to Qwen3- VL-235B. We release both AndroidControl-Curated benchmark and Magma-R1 model to the research community, encouraging adoption of this enhanced benchmark to better reflect model capabilities and accelerate the development of robust, on-device virtual assistants.
翻译:以Siri和Google Assistant为代表的设备端虚拟助手日益重要,但其能力受限于对僵化且依赖开发者的API的依赖。GUI智能体提供了一种强大且不依赖API的替代方案,但其应用却因性能不佳的认知而受阻——即使在AndroidControl等基准测试中,最佳模型(如Qwen3-VL-235B)的得分也仅停留在60%左右,远未达到实际应用水平。我们的研究发现,问题不仅在于模型本身,更在于基准测试的缺陷。我们识别出AndroidControl基准中存在显著缺陷,包括任务描述模糊和事实性错误,这些缺陷系统性低估了智能体的真实能力。为纠正这一关键疏漏,我们通过严格的净化流程将AndroidControl增强为AndroidControl-Curated基准。在此优化后的基准上,最先进模型在复杂任务上的成功率接近75%(提升15%),这表明设备端GUI智能体实际比以往认知更接近实际部署。我们推出了新的SOTA模型Magma-R1-3B,该模型仅使用2.4k个精选样本在H20 GPU上经过60小时训练(成本约60美元)。尽管参数量缩小200倍,其性能仍与Qwen3-VL-235B相当。我们向研究社区开源AndroidControl-Curated基准与Magma-R1模型,倡导采用此增强基准以更准确评估模型能力,加速开发鲁棒的设备端虚拟助手。