Most web agents operate at the human interface level, observing screenshots or raw DOM trees without application-level access, which limits robustness and action expressiveness. In enterprise settings, however, explicit control of both the frontend and backend is available. We present EmbeWebAgent, a framework for embedding agents directly into existing UIs using lightweight frontend hooks (curated ARIA and URL-based observations, and a per-page function registry exposed via a WebSocket) and a reusable backend workflow that performs reasoning and takes actions. EmbeWebAgent is stack-agnostic (e.g., React or Angular), supports mixed-granularity actions ranging from GUI primitives to higher-level composites, and orchestrates navigation, manipulation, and domain-specific analytics via MCP tools. Our demo shows minimal retrofitting effort and robust multi-step behaviors grounded in a live UI setting. Live Demo: https://youtu.be/Cy06Ljee1JQ
翻译:大多数网页智能体在用户界面层面运行,通过观察屏幕截图或原始DOM树进行操作,缺乏应用层级的访问权限,这限制了其鲁棒性和动作表达能力。然而,在企业环境中,通常可同时对前端和后端进行显式控制。本文提出EmbeWebAgent,一个通过轻量级前端钩子(精选的ARIA与基于URL的观测数据,以及通过WebSocket暴露的每页面函数注册表)和可复用的后端工作流(负责推理与执行动作)将智能体直接嵌入现有用户界面的框架。EmbeWebAgent与技术栈无关(例如React或Angular均可),支持从图形用户界面基础操作到高层复合动作的混合粒度操作,并通过MCP工具协调导航、操控及领域特定分析。我们的演示表明,该框架仅需极少的改造工作即可实现基于实时用户界面的鲁棒多步骤行为。实时演示:https://youtu.be/Cy06Ljee1JQ