The increasing reliance on web interfaces presents many challenges for visually impaired users, showcasing the need for more advanced assistive technologies. This paper introduces WebNav, a voice-controlled web navigation agent that leverages a ReAct-inspired architecture and generative AI to provide this framework. WebNav comprises of a hierarchical structure: a Digital Navigation Module (DIGNAV) for high-level strategic planning, an Assistant Module for translating abstract commands into executable actions, and an Inference Module for low-level interaction. A key component is a dynamic labeling engine, implemented as a browser extension, that generates real-time labels for interactive elements, creating mapping between voice commands and Document Object Model (DOM) components. Preliminary evaluations show that WebNav outperforms traditional screen readers in response time and task completion accuracy for the visually impaired. Future work will focus on extensive user evaluations, benchmark development, and refining the agent's adaptive capabilities for real-world deployment.
翻译:随着对网络界面的依赖日益增加,视障用户面临诸多挑战,这凸显了对更先进辅助技术的需求。本文介绍了WebNav,一种基于ReAct启发式架构与生成式人工智能的语音控制网络导航代理,旨在提供此类解决方案。WebNav采用分层结构:用于高层策略规划的数字导航模块(DIGNAV)、用于将抽象指令转换为可执行操作的辅助模块,以及用于底层交互的推理模块。其关键组件是一个动态标注引擎,该引擎以浏览器扩展形式实现,能够为交互元素生成实时标签,从而建立语音指令与文档对象模型(DOM)组件之间的映射关系。初步评估表明,在响应时间和任务完成准确性方面,WebNav对视障用户的表现优于传统屏幕阅读器。未来工作将集中于开展大规模用户评估、建立基准测试体系,并优化代理的适应能力以支持实际部署。