Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms. Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising simple tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we introduce GUI Odyssey, a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos. Leveraging GUI Odyssey, we developed OdysseyAgent, a multimodal cross-app navigation agent by fine-tuning the Qwen-VL model with a history resampling module. Extensive experiments demonstrate OdysseyAgent's superior accuracy compared to existing models. For instance, OdysseyAgent surpasses fine-tuned Qwen-VL and zero-shot GPT-4V by 1.44\% and 55.49\% in-domain accuracy, and 2.29\% and 48.14\% out-of-domain accuracy on average. The dataset and code will be released in \url{https://github.com/OpenGVLab/GUI-Odyssey}.
翻译:智能手机用户经常需要跨多个应用程序(应用)进行导航以完成任务,例如在社交媒体平台之间分享内容。自主图形用户界面(GUI)导航代理可通过简化工作流程和减少人工干预,在通信、娱乐和生产力方面提升用户体验。然而,现有的GUI代理通常使用仅包含可在单个应用内完成的简单任务的数据集进行训练,导致其在跨应用导航中表现不佳。为解决此问题,我们引入了GUI Odyssey,一个用于训练和评估跨应用导航代理的综合数据集。GUI Odyssey包含来自6台移动设备的7,735个任务片段,涵盖6类跨应用任务、201个应用以及1.4K种应用组合。基于GUI Odyssey,我们通过结合历史重采样模块对Qwen-VL模型进行微调,开发了多模态跨应用导航代理OdysseyAgent。大量实验表明,OdysseyAgent相比现有模型具有更高的准确率。例如,在领域内准确率上,OdysseyAgent平均超过微调Qwen-VL和零样本GPT-4V分别达1.44%和55.49%;在领域外准确率上,平均超出幅度分别为2.29%和48.14%。数据集与代码将发布于 \url{https://github.com/OpenGVLab/GUI-Odyssey}。