The rise of powerful multimodal LLMs has enhanced the viability of building web agents which can, with increasing levels of autonomy, assist users to retrieve information and complete tasks on various human-computer interfaces. It is hence necessary to build challenging benchmarks that span a wide-variety of use cases reflecting real-world usage. In this work, we present WebQuest, a multi-page question-answering dataset that requires reasoning across multiple related web pages. In contrast to existing UI benchmarks that focus on multi-step web navigation and task completion, our dataset evaluates information extraction, multimodal retrieval and composition of information from many web pages. WebQuest includes three question categories: single-screen QA, multi-screen QA, and QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4V, Gemini Flash, Claude 3, and open source models like InstructBLIP, PaliGemma on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. Finally, we investigate inference time techniques like Chain-of-Thought prompting to improve model capabilities on multi-screen reasoning.
翻译:随着强大多模态大语言模型的兴起,构建能够以日益增强的自主性协助用户在各种人机界面上检索信息和完成任务网络智能体的可行性得以提升。因此,有必要构建涵盖反映真实世界使用场景的广泛用例的具有挑战性的基准测试。本工作中,我们提出了WebQuest,一个需要跨多个相关网页进行推理的多页面问答数据集。与现有专注于多步骤网页导航和任务完成的用户界面基准不同,我们的数据集评估从多个网页进行信息提取、多模态检索和信息整合的能力。WebQuest包含三类问题:单屏幕问答、多屏幕问答以及基于导航轨迹的问答。我们在该数据集上评估了领先的专有多模态模型(如GPT-4V、Gemini Flash、Claude 3)和开源模型(如InstructBLIP、PaliGemma),揭示了单屏幕与多屏幕推理能力之间存在显著差距。最后,我们研究了思维链提示等推理时技术,以提升模型在多屏幕推理任务上的能力。