We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications. This enables the agent to seamlessly navigate and operate within individual applications and across them to fulfill user requests, even when spanning multiple applications. The framework incorporates a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. Consequently, UFO transforms arduous and time-consuming processes into simple tasks achievable solely through natural language commands. We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios reflective of users' daily usage. The results, derived from both quantitative metrics and real-case studies, underscore the superior effectiveness of UFO in fulfilling user requests. To the best of our knowledge, UFO stands as the first UI agent specifically tailored for task completion within the Windows OS environment. The open-source code for UFO is available on https://github.com/microsoft/UFO.
翻译:本文介绍UFO,一种创新的UI智能体,利用GPT-Vision的能力在Windows操作系统中完成面向应用程序的用户请求。UFO采用双智能体框架,对Windows应用程序的图形用户界面(GUI)和控制信息进行精细化观察与分析。该框架使智能体能够在单个应用程序内部及跨应用程序间无缝导航与操作,以完成可能涉及多个应用的用户请求。框架集成的控制交互模块实现了无需人工干预的动作定位与全自动执行。因此,UFO将原本繁琐耗时的操作流程转化为仅需自然语言指令即可完成的简单任务。我们在9款常用Windows应用程序上对UFO进行了测试,涵盖多种反映用户日常使用习惯的场景。基于量化指标与实际案例的研究结果表明,UFO在完成用户请求方面具有显著优势。据我们所知,UFO是首个专门针对Windows操作系统环境任务完成而设计的UI智能体。项目开源代码发布于https://github.com/microsoft/UFO。