Personal AI agents must increasingly operate across APIs, shells, web surfaces, and desktop GUIs, yet many systems remain tuned to a single interface and offer limited support for user teaching and auditability. We present Syll, an open-source, self-hosted multimodal agent harness that unifies MCP/API tools, CLI execution, and visual GUI control in a modular runtime, enabling agents to coordinate computer use across heterogeneous interfaces while streamlining how users and agents exchange information. At the core of Syll is a bidirectional user-agent interaction layer: users teach procedures through direct demonstration, which Syll compiles into reusable skills; agent execution is translated back into multimodal evidence -- logs, keyframes, and approval checkpoints -- for inspection and control. Syll further externalizes memory, skills, routines, and governance as editable local artifacts, supporting straightforward inspection, extension, and downstream development. Our implementation has been validated on production desktop applications including Adobe Photoshop, Adobe Audition, Stardew Valley, macOS Finder and others. We report mechanism-oriented studies that validate multimodal routing, teachable GUI replay, and persistent local artifacts. We hope Syll can serve as a practical open-source foundation for personal automation that users can teach, inspect, and continuously extend.
翻译:个人AI代理日益需要跨API、shell、网页界面及桌面GUI协同操作,然而现有系统多局限于单一界面模式,对用户教学与审计能力支持有限。本文提出Syll——一个开源、自托管的多模态代理框架,其模块化运行时统一了MCP/API工具、命令行执行与可视化GUI控制,使代理能够协调跨异构界面的人机交互,同时优化用户与代理间的信息交换机制。Syll核心采用双向用户-代理交互层:用户通过直接演示教学操作流程,Syll将其编译为可复用的技能模块;代理执行过程则反向转化为多模态证据(日志、关键帧与审批检查点)供用户审查与调控。系统进一步将内存、技能、例程与治理规则外化为可编辑的本地制品,支持直观的检查、扩展与下游开发。本实现已在包括Adobe Photoshop、Adobe Audition、星露谷物语、macOS Finder等生产级桌面应用上完成验证。我们开展了面向机制的研究,验证了多模态路由、可教学GUI回放与持久化本地制品的有效性。期望Syll能作为可教学、可审查、可持续扩展的个人自动化开源实用基础框架。