Web agents powered by Large Language Models (LLMs) have demonstrated remarkable abilities in planning and executing multi-step interactions within complex web-based environments, fulfilling a wide range of web navigation tasks. Despite these advancements, the potential for LLM-powered agents to effectively engage with sequential user instructions in real-world scenarios has not been fully explored. In this work, we introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment, supported by a specially developed dataset named Multi-Turn Mind2Web (MT-Mind2Web). To tackle the limited context length of LLMs and the context-dependency issue of the conversational tasks, we further propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques. Extensive experiments are conducted to benchmark the MT-Mind2Web dataset, and validate the effectiveness of the proposed method.
翻译:基于大型语言模型(LLM)的网络代理在复杂网络环境中的多步交互规划与执行方面已展现出显著能力,能够完成多样化的网络导航任务。尽管取得这些进展,LLM驱动的代理在真实场景中有效处理连续用户指令的潜力尚未得到充分探索。本文提出了一项名为"对话式网络导航"的新任务,该任务要求代理与用户及环境进行跨越多个轮次的复杂交互,并基于专门构建的数据集Multi-Turn Mind2Web(MT-Mind2Web)展开研究。针对LLM上下文长度限制及对话任务的上下文依赖性问题,我们进一步提出了一种名为"自反式记忆增强规划(Self-MAP)"的创新框架,该框架运用记忆利用与自反技术。通过大量实验对MT-Mind2Web数据集进行基准测试,验证了所提方法的有效性。