General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.
翻译:通用网络智能体在与复杂网络环境交互中日益重要,但其在真实网络应用中的表现仍然欠佳,即使采用最先进的尖端模型,其准确率也极低。我们观察到这些智能体可分解为两个主要组成部分:规划与基础。然而,现有研究大多将这些智能体视为黑箱,专注于端到端评估,这阻碍了实质性的改进。我们通过细化Mind2Web数据集的实验,对规划与基础组件进行了明确区分并开展了创新性分析。本研究为每个组件分别提出了新的基准测试,识别了限制智能体性能的瓶颈与痛点。与普遍假设相反,我们的研究结果表明基础并非显著瓶颈,现有技术已能有效解决该问题。真正的核心挑战在于规划组件,这才是性能下降的主要根源。通过此项分析,我们提出了改进网络智能体能力的新见解与实用建议,为开发更可靠的智能体铺平道路。