We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications. This enables the agent to seamlessly navigate and operate within individual applications and across them to fulfill user requests, even when spanning multiple applications. The framework incorporates a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. Consequently, UFO transforms arduous and time-consuming processes into simple tasks achievable solely through natural language commands. We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios reflective of users' daily usage. The results, derived from both quantitative metrics and real-case studies, underscore the superior effectiveness of UFO in fulfilling user requests. To the best of our knowledge, UFO stands as the first UI agent specifically tailored for task completion within the Windows OS environment. The open-source code for UFO is available on https://github.com/microsoft/UFO.
翻译:我们提出 UFO,一种创新的面向 UI 的智能体,利用 GPT-Vision 的能力,专为满足 Windows 操作系统上应用程序的用户请求而设计。UFO 采用双智能体框架,细致观察并分析 Windows 应用程序的图形用户界面 (GUI) 及控件信息。这使得智能体能够无缝地在单个应用内及跨应用之间导航与操作,即使涉及多个应用,也能完成用户请求。该框架包含一个控件交互模块,无需人工干预即可实现动作落地,从而支持全自动执行。因此,UFO 将艰巨且耗时的流程转变为仅需自然语言指令即可完成的简单任务。我们在 9 款流行 Windows 应用上对 UFO 进行了测试,涵盖了反映用户日常使用的多种场景。基于定量指标与真实案例研究的结果,均凸显了 UFO 在满足用户请求方面具有卓越有效性。据我们所知,UFO 是首个专门针对 Windows 操作系统环境下任务完成的 UI 智能体。UFO 的开源代码可在 https://github.com/microsoft/UFO 获取。