OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

Le Zhang,Yixiong Xiao,Xinjiang Lu,Jingjia Cao,Yusai Zhao,Jingbo Zhou,Lang An,Zikan Feng,Wanxiang Sha,Yu Shi,Congxi Xiao,Jian Xiong,Yankai Zhang,Hua Wu,Haifeng Wang

Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.

翻译：图形用户界面（GUI）智能体在使基础模型完成现实世界任务方面展现出巨大潜力，有望彻底改变人机交互方式并提升人类生产效率。本报告介绍OmegaUse，一个用于移动和桌面平台自主任务执行的通用GUI智能体模型，支持计算机使用与手机使用场景。构建有效的GUI智能体模型依赖于两个关键因素：（1）高质量数据与（2）有效的训练方法。为此，我们引入了精心设计的数据构建流程与解耦训练范式。在数据构建方面，我们利用严格筛选的开源数据集，并提出一种新颖的自动化合成框架，该框架将自底向上的自主探索与自顶向下的分类法引导生成相结合，以创建高保真合成数据。在训练方面，为更好地利用这些数据，我们采用两阶段策略：首先通过监督微调（SFT）建立基础交互语法，随后通过组相对策略优化（GRPO）提升空间定位与序列规划能力。为平衡计算效率与智能体推理能力，OmegaUse基于混合专家（MoE）架构构建。为在离线环境中评估跨终端能力，我们提出了OS-Nav基准测试套件，涵盖多个操作系统：针对中文Android移动环境的ChiM-Nav，以及专注于Ubuntu系统常规桌面交互的Ubu-Nav。大量实验表明，OmegaUse在现有GUI基准测试中表现出高度竞争力，在ScreenSpot-V2上取得了96.3%的最先进（SOTA）分数，在AndroidControl上获得了领先的79.1%步骤成功率。OmegaUse在OS-Nav基准上也表现强劲，在ChiM-Nav上达到74.24%的步骤成功率，在Ubu-Nav上实现55.9%的平均成功率。