UI Agents powered by increasingly performant AI promise to eventually use computers the way that people do - by visually interpreting UIs on screen and issuing appropriate actions to control them (e.g., mouse clicks and keyboard entry). While significant progress has been made on interpreting visual UIs computationally, and in sequencing together steps to complete tasks, controlling UIs is still done with system-specific APIs or VNC connections, which limits the platforms and use cases that can be explored. This paper introduces HIDAgent, an open-source hardware/software toolkit enabling UI agents to operate HID-compatible computing systems by emulating the physical keyboard and mouse. HIDAgent is built using three off-the-shelf components costing less than $30 and a Python library supporting flexible integration. We validated the HIDAgent toolkit by building five diverse use case prototypes across mobile and desktop platforms. As a hardware device, HIDAgent supports research into new interaction scenarios where the agents are separated from the devices they control.
翻译:由性能日益提升的人工智能驱动的用户界面代理,有望最终像人类一样使用计算机——通过视觉解析屏幕上的用户界面,并发出适当的控制操作(例如鼠标点击和键盘输入)。尽管在计算解析可视化用户界面以及按步骤序贯完成任务方面已取得显著进展,但当前控制用户界面仍需依赖特定系统的API或VNC连接,这限制了可探索的平台和用例场景。本文介绍HIDAgent,这是一个开源硬件/软件工具包,通过模拟物理键盘和鼠标,使用户界面代理能够操作HID兼容的计算系统。HIDAgent采用三个总成本低于30美元的量产组件和一个支持灵活集成的Python库构建而成。我们通过在移动和桌面平台上构建五个多样化用例原型,验证了HIDAgent工具包的可行性。作为硬件设备,HIDAgent为研究代理程序与其控制设备分离的新型交互场景提供了支持。