Creating autonomous virtual agents capable of using arbitrary software on any digital device remains a major challenge for artificial intelligence. Two key obstacles hinder progress: insufficient infrastructure for building virtual agents in real-world environments, and the need for in-the-wild evaluation of fundamental agent abilities. To address this, we introduce AgentStudio, an online, realistic, and multimodal toolkit that covers the entire lifecycle of agent development. This includes environment setups, data collection, agent evaluation, and visualization. The observation and action spaces are highly generic, supporting both function calling and human-computer interfaces. This versatility is further enhanced by AgentStudio's graphical user interfaces, which allow efficient development of datasets and benchmarks in real-world settings. To illustrate, we introduce a visual grounding dataset and a real-world benchmark suite, both created with our graphical interfaces. Furthermore, we present several actionable insights derived from AgentStudio, e.g., general visual grounding, open-ended tool creation, learning from videos, etc. We have open-sourced the environments, datasets, benchmarks, and interfaces to promote research towards developing general virtual agents for the future.
翻译:创建能够自主使用任意数字设备上任意软件的虚拟智能体,仍是人工智能领域的一项重大挑战。两大关键障碍阻碍了进展:在真实环境中构建虚拟智能体的基础设施不足,以及需要针对智能体基础能力进行野外评估。为解决这些问题,我们提出AgentStudio——一个在线、真实、多模态的工具包,覆盖智能体开发的完整生命周期,包括环境搭建、数据采集、智能体评估与可视化。其观测空间和动作空间具有高度通用性,既支持函数调用,也兼容人机交互界面。这种通用性通过AgentStudio的图形用户界面得到进一步增强,可在真实环境中高效开发数据集与基准测试。为加以说明,我们引入了一个视觉定位数据集和一个真实世界基准测试套件,二者均通过图形界面创建。此外,我们提出从AgentStudio中获得的若干可操作洞见,例如通用视觉定位、开放式工具创建、从视频中学习等。我们已开源环境、数据集、基准测试和界面,以推动未来通用虚拟智能体的研究。